Can small and reasoning large language models score journal articles for research quality and do averaging and few-shot help?
Abstract Previous research has shown that journal article quality ratings from the cloud based Large Language Model (LLM) families ChatGPT and Gemini and the medium sized open weights LLM Gemma3 27b correlate moderately with expert research quality scores. This article assesses whether other medium sized LLMs, smaller LLMs, and reasoning models have similar abilities. This is tested with Gemma3 variants, Llama4 Scout, Qwen3, Magistral Small and DeepSeek R1 on a dataset of 2780 medical, health and life science papers in 6 fields, with two different gold standards, one novel. Few-shot and score averaging approaches are also evaluated. The results suggest that medium-sized LLMs have similar performance to ChatGPT-4o mini and Gemini 2.0 Flash, but that 1b parameters may often, and 4b sometimes, be too few. Reasoning models did not have a clear advantage. Moreover, averaging scores from multiple identical queries seems to be a universally successful strategy, and there is weak evidence that few-shot prompts (four examples) tend to help. Overall, the results show, for the first time, that smaller LLMs > 4b have a substantial capability to rate journal articles for research quality, especially if score averaging is used, but that reasoning does not give an advantage for this task; it is therefore not recommended because it is slow. The use of LLMs to support research evaluation is now more credible since multiple variants have a similar ability, including many that can be deployed offline in a secure environment without substantial computing resources.
- Research Article
3
- 10.2196/70703
- Apr 28, 2025
- Journal of medical Internet research
The aging population presents an accomplishment for society but also poses significant challenges for governments, health care systems, and caregivers. Elevated rates of functional limitations among older adults, primarily caused by chronic conditions, necessitate adequate and safe care, including in-home settings. Traditionally, informal caregiver training has relied on verbal and written instructions. However, the advent of digital resources has introduced videos and interactive platforms, offering more accessible and effective training. Large language models (LLMs) have emerged as potential tools for personalized information delivery. While LLMs exhibit the capacity to mimic clinical reasoning and support decision-making, their potential to serve as alternatives to evidence-based professional instruction remains unexplored. We aimed to evaluate the appropriateness of home care instructions generated by LLMs (including GPTs) in comparison to a professional gold standard. Furthermore, it seeks to identify specific domains where LLMs show the most promise and where improvements are necessary to optimize their reliability for caregiver training. An observational, comparative case study evaluated 3 LLMs-GPT-3.5, GPT-4o, and Microsoft Copilot-in 10 home care scenarios. A rubric assessed the models against a reference standard (gold standard) created by health care professionals. Independent reviewers evaluated variables including specificity, clarity, and self-efficacy. In addition to comparing each LLM to the gold standard, the models were also compared against each other across all study domains to identify relative strengths and weaknesses. Statistical analyses compared LLMs performance to the gold standard to ensure consistency and validity, as well as to analyze differences between LLMs across all evaluated domains. The study revealed that while no LLM achieved the precision of the professional gold standard, GPT-4o outperformed GPT-3.5, and Copilot in specificity (4.6 vs 3.7 and 3.6), clarity (4.8 vs 4.1 and 3.9), and self-efficacy (4.6 vs 3.8 and 3.4). However, the models exhibited significant limitations, with GPT-4o and Copilot omitting relevant details in 60% (6/10) of the cases, and GPT-3.5 doing so in 80% (8/10). When compared to the gold standard, only 10% (2/20) of GPT-4o responses were rated as equally specific, 20% (4/20) included comparable practical advice, and just 5% (1/20) provided a justification as detailed as professional guidance. Furthermore, error frequency did not differ significantly across models (P=.65), though Copilot had the highest rate of incorrect information (20%, 2/10 vs 10%, 1/10 for GPT-4o and 0%, 0/0 for GPT-3.5). LLMs, particularly GPT-4o subscription-based, show potential as tools for training informal caregivers by providing tailored guidance and reducing errors. Although not yet surpassing professional instruction quality, these models offer a flexible and accessible alternative that could enhance home safety and care quality. Further research is necessary to address limitations and optimize their performance. Future implementation of LLMs may alleviate health care system burdens by reducing common caregiver errors.
- Research Article
- 10.1016/j.acra.2025.12.020
- Apr 1, 2026
- Academic radiology
Limitations of Large Language Models in Assisting PI-RADS Scoring on Prostate Biparametric MRI Text Reports.
- Research Article
9
- 10.1016/j.cjca.2024.05.022
- May 31, 2024
- Canadian Journal of Cardiology
Revolutionizing Cardiology With Words: Unveiling the Impact of Large Language Models in Medical Science Writing
- Research Article
- 10.1177/20552076251349616
- May 1, 2025
- DIGITAL HEALTH
Objective To investigate the performance (accuracy, comprehensiveness, consistency, and the necessary information ratio) of large language models (LLMs) in providing knowledge related to respiratory aspiration, and to explore the potential of using LLMs as training tools. Methods This study was a non-human-subject evaluative research. Two LLMs (GPT-3.5 and GPT-4) were asked 36 questions (32 objective questions and four subjective questions) about respiratory aspiration in English and Chinese. Responses were scored by two experts against gold standards derived from authoritative books. The accuracy of the two LLMs’ responses of objective questions were compared by chi-square test or Fisher exact probability method. For subjective questions, the t-test or Mann–Whitney U test was used to compare the differences between two LLMs. Results There was no significant difference in the ratings provided by the two experts. The accuracy scores of objective questions of two LLMs were high. LLMs also performed well on subjective questions, showing high levels of accuracy, comprehensiveness, consistency, and necessary information ratio. And no significant differences were found in the accuracy of the English and Chinese responses to subjective questions between the two LLMs (z = 0.331, p = 0.886; z = 1.703, p = 0.114). There was no significant difference in the comprehensiveness of the English and Chinese responses between the two LLMs (t = 0.787, p = 0.461; t = 1.175, p = 0.285). Conclusions LLMs demonstrated promising performance in delivering respiratory aspiration-related knowledge and showed promise as supportive tools in training, particularly when their limitations were well understood.
- Research Article
1
- 10.22452/mjlis.vol30no2.4
- Aug 30, 2025
- Malaysian Journal of Library and Information Science
Academic librarians often construct bibliometric indicators to support research evaluation. Traditionally, these have been citation-based, but AI alternatives have recently emerged. Although both Google Gemini (1.5 Flash) and ChatGPT (4o and 4o-mini) provide research quality evaluation scores that correlate positively with expert scores in nearly all fields, and more strongly than citations in most, it is not known whether this holds for smaller Large Language Models (LLMs). In response, this article assesses Google’s Gemma-3-27b-it, a downloadable LLM (60 GB). Results for 104,187 articles show that Gemma-3-27b-it scores correlate positively with an expert research quality score proxy for all 34 Units of Assessment (broad fields) from the UK Research Excellence Framework 2021. The Gemma-3-27b-it correlations have 83.8% of the strength of ChatGPT 4o and 94.7% of the strength of ChatGPT 4o-mini correlations. Unlike the two larger LLMs, the Gemma-3-27b-it correlations do not increase substantially when scores are averaged across five repetitions, its scores tend to be lower, and its reports are relatively uniform in style. Overall, the results show that research quality score estimation can be conducted by offline LLMs, so this capability is not an emergent property of only the largest LLMs. Moreover, score improvement through repetition is not a universal feature of LLMs. In conclusion, although the largest LLMs still have the highest research evaluation score estimation capability, smaller ones can also be used for this task, which can be helpful for cost saving or when secure offline processing is required.
- Research Article
16
- 10.1093/bjd/ljae377
- Oct 4, 2024
- The British journal of dermatology
Large language models (LLMs) have a potential role in providing adequate patient information. To compare the quality of LLM responses with established Dutch patient information resources (PIRs) in answering patient questions regarding melanoma. Responses from ChatGPT versions 3.5 and 4.0, Gemini, and three leading Dutch melanoma PIRs to 50 melanoma-specific questions were examined at baseline and for LLMs again after 8 months. Outcomes included (medical) accuracy, completeness, personalization, readability and, additionally, reproducibility for LLMs. Comparative analyses were performed within LLMs and PIRs using Friedman's Anova, and between best-performing LLMs and gold-standard (GS) PIRs using the Wilcoxon signed-rank test. Within LLMs, ChatGPT-3.5 demonstrated the highest accuracy (P = 0.009). Gemini performed best in completeness (P < 0.001), personalization (P = 0.007) and readability (P < 0.001). PIRs were consistent in accuracy and completeness, with the general practitioner's website excelling in personalization (P = 0.013) and readability (P < 0.001). The best-performing LLMs outperformed the GS-PIR on completeness and personalization, yet it was less accurate and less readable. Over time, response reproducibility decreased for all LLMs, showing variability across outcomes. Although LLMs show potential in providing highly personalized and complete responses to patient questions regarding melanoma, improving and safeguarding accuracy, reproducibility and accessibility is crucial before they can replace or complement conventional PIRs.
- Research Article
- 10.1200/jco.2025.43.16_suppl.e22603
- Jun 1, 2025
- Journal of Clinical Oncology
e22603 Background: Constructing databases is crucial for answering clinical questions but is time-consuming and error-prone. Our institution has maintained a REDCap database of cancer genetics encounters since 2002, manually curated by research assistants. We explored automating some data entry using a HIPAA-compliant, commercially available large language model (LLM). Methods: We randomly selected 100 patients from our database since 2017; a board-certified oncologist reviewed each chart to establish a gold standard. We examined variable abstraction for (1) whether genetic testing was ordered, (2) whether genetic testing results were obtained, (3) whether a variant was identified and, if so, the (4) gene and (5) variant status (benign, uncertain significance, or pathogenic). For the LLM input, we provided every Epic note and letter from January 2017 to January 2025 from the Cancer Genetics group (n = 308) for the 100 patients. For patients with multiple notes, we took (1) concordant values from ≥ 2 notes or (2) a non-benign variant as the true LLM result. We made two API calls per note using Stanford Healthcare Secure GPT with OpenAI’s gpt-4o model. The code is available at https://github.com/MrJimb0/ASCO2025 . We calculated summary statistics for time, token use, accuracy, and sensitivity/specificity, with the oncologist chart review as the reference. Results: The LLM accurately categorized 88% of the 100 patients compared to 87% by research assistants in REDCap. LLM errors that occurred in more than one patient were from information being outside of the provided notes (n = 4), information being in an image never converted to text (n = 2), and incorrectly interpreting a familial variant as being the patients’ (n = 2). In contrast, errors in REDCap were from new results returning after the date the research assistant did data entry (n = 7) and typos (n = 5). 29% of the cohort had a pathogenic variant. The LLM had a sensitivity of 83% and specificity of 96% for pathogenic variant detection, compared to 76% and 100% for REDCap. The LLM processed an average of 9,801 input tokens and 372 output tokens per patient, processing each patient in approximately 24 seconds. For a research assistant, the average time was 6 minutes per patient. Assuming 2,500 patients in a year, typical for this clinic, the LLM would take 16.5 hours of work at around $72 compared to 250 hours, or $7,500 of effort, for a research assistant. Conclusions: Compared to abstraction by a research assistant, the LLM was quicker and had similar sensitivity and specificity for these five variables. We obtained these results without hyperparameter tuning, vectorization, note standardization, model retraining, or the development of a foundational model. These results suggest that commercial LLMs with limited prompt engineering and post-LLM processing can support chart review in cancer genetics, potentially reducing costs and improving the efficiency of database construction.
- Research Article
- 10.1177/20552076261430065
- Feb 1, 2026
- Digital health
The increasing use of large language models (LLMs) for manuscript preparation and content generation presents both opportunities and risks, creating an urgent need for clear guidance. While many journals have introduced directives, their consistency and scope remain unclear. This study aimed to assess the prevalence and nature of LLM use guidance in emergency medicine publishing. We conducted a cross-sectional analysis of emergency medicine journals, reviewing websites for directives on LLM use by authors, and regarding the use of AI in the peer review process by editors and reviewers. Data were extracted on guidance existence, stakeholder requirements, publisher adoption, and association with journal metrics. Of the 56 journals, 38 (68%) provided a directive on LLM use. While all 38 (100%) permitted LLM use for writing, guidance for authors on image generation was conflicting: 32% permitted it, while 40% explicitly prohibited it. Directives for editors were similarly contradictory, with 24% prohibiting LLM use and one (3%) permitting it. For reviewers, 47% prohibited LLM use, while one (3%) permitted it. Publisher-driven fragmentation was profound, with adoption rates varying from 100% to 18%. Notably, no statistically significant differences were detected between the presence of a directive and journal quality metrics (P > .05). Emergency medicine publishing demonstrates significant variations and conflicting guidance in its governance of LLM use. Existing directives present contradictory rules for authors, editors, and reviewers on key issues like image generation and use in peer review. To close this critical guidance gap, a comprehensive, standardized framework is urgently needed to resolve these conflicts and foster the responsible integration of digital technologies into scholarly publishing.
- Research Article
5
- 10.1007/s10869-025-10035-6
- Jun 21, 2025
- Journal of Business and Psychology
Researchers are increasingly exploring the use of large language models (LLMs) to develop materials for surveys and experiments. However, clear guidance on effective implementation remains limited. In this paper, we propose a decision-making framework comprising five use cases for integrating large language models into psychological survey and experimental methods: (1) LLM as research assistant; (2) LLM as adaptive content creator; (3) LLM as external resource; (4) LLM as conversation partner, and (5) LLM as research confederate. To support these applications, we introduce the open-source Qualtrics-AI Link (QUAIL), a software designed to integrate content generated by ChatGPT’s LLM foundation model into the Qualtrics platform. Across contexts, and for all scenarios involving the use of LLMs in research material creation, we provide guidance on the technical steps necessary to support both internal and external validity. These include effective prompt engineering, model selection, alpha and beta testing, launching, and monitoring. We conclude with a discussion of relevant ethical considerations, cautions, and resources for auditing validity claims. Throughout, we emphasize that good research design and adherence to ethical principles should guide decision-making, and that researcher expertise in both LLMs and research design is essential to ensure valid participant interactions when using LLM-based tools.
- Research Article
- 10.3760/cma.j.cn112139-20250814-00402
- Feb 1, 2026
- Zhonghua wai ke za zhi [Chinese journal of surgery]
Objective: To explore the performance of large language model (LLM) in diagnosing clinically significant prostate cancer (csPCa), and the improvement in diagnostic performance of open-source LLM after low-rank adaptation (LoRA) fine-tuning. Methods: This is a retrospective case series study. Data from 1 077 patients who underwent ultrasound-guided systematic prostate biopsy at Department of Urology,Peking University Third Hospital from January 2018 to December 2024 were collected, aged (M(IQR)) 69(13) years (range:38 to 90 years) including 391 patients in the gray zone (prostate-specific antigen 4 to 10 μg/L). The collected data included patients' clinical characteristics, prostate MRI reports, and biopsy histopathological results. Four LLM (GPT 4.1, DeepSeek R1, Qwen3-235B-A22B, Qwen3-32B) were used to diagnose csPCa based on patient information, and the performance of the LLM was evaluated using biopsy histopathological results as the gold standard. Subsequently, the data from 1 077 patients were divided into training and test sets at an 8∶2 ratio, and LoRA fine-tuning was performed on Qwen3-32B. The fine-tuned model was named PCD-Qwen3, and its diagnostic efficacy in the test set was evaluated. The receiver operating characteristics curve was plotted and the area under the curve (AUC) and 95%CI were calculated to evaluate the diagnostic performance of LLM. The Delong test was used to compare the differences in AUC between groups. Results: Among all patients, DeepSeek R1 had the highest AUC for diagnosing csPCa at 0.848 (95%CI: 0.826 to 0.871), with statistically significant differences compared to Qwen3-235B-A22B (0.827 (95%CI: 0.803 to 0.851)) and Qwen3-32B (0.753 (95%CI: 0.724 to 0.781))(Z=2.34, P=0.020; Z=7.35, P<0.01), but no difference compared to GPT 4.1(0.842 (95%CI: 0.819 to 0.865))(P>0.05). The accuracy, sensitivity, and specificity of DeepSeek R1 for diagnosing csPCa were 77.3%, 70.2%, and 84.1%, respectively. In the gray zone patient population with total prostate specific antigen of 4 to 10 μg/L, DeepSeek R1 had an AUC of 0.765 (95%CI: 0.715 to 0.816) for diagnosing csPCa. Using DeepSeek R1 to diagnose gray zone patients could avoid 46.3% (181/391) of unnecessary biopsies while missing 5.9% (23/391) of csPCa patients. Except for Qwen3-32B, the PI-RADS scores evaluated by the three LLM achieved moderate agreement with those of radiologists. After LoRA fine-tuning, the diagnostic performance of PCD-Qwen3 was significantly improved compared to Qwen3-32B. In the test set of 216 patients, the accuracy, sensitivity, specificity, and AUC were 77.3%, 75.5%, 79.1%, and 0.831 (95%CI: 0.776 to 0.885), respectively, comparable to the performance of DeepSeek R1 (all P>0.05). Conclusions: Among the four LLM, DeepSeek R1 had the best performance in diagnosing csPCa. After LoRA fine-tuning, PCD-Qwen3 achieved performance comparable to DeepSeek R1. LLM demonstrated promising application value in diagnosing csPCa.
- Research Article
- 10.1161/circ.152.suppl_3.4363153
- Nov 4, 2025
- Circulation
Background: Preventive cardiology relies on a comprehensive view of patient health, including biomarkers and imaging findings. However, critical data, such as coronary calcium scores (CCS) and measures from CTA Fractional Flow Reserve (CT-FFR), often reside in unstructured free text within EHR, making them difficult to access and use effectively in clinical decision-making. Objective: We developed and validated large language models (LLMs) to extract novel biomarkers from unstructured cardiovascular data, and integrate them into a clinician-friendly interface to facilitate clinical decision making. Methods: We compared a natural language processing (NLP) technique to LLMs for extracting total and vessel-specific CCS and vessel-specific CT-FFR from free-text reports. We validated the measures through chart review, the gold standard. Through 6 iterations of prompt engineering and 2 different LLM models, we achieved 100% accuracy for both total CCS and CT-FFR. We applied the final prompt and LLM to all CTA and CT-FFR reports available for patients seen by preventive cardiology program. The discrete values were displayed in the preventive care dashboard to provide the clinical team with a comprehensive view for better management. Results: Among 255 CTA reports we extracted from 12/01/2023-11/22/2024, traditional NLP could only extract total CCS from 137 (54%) reports, while the LLM was able to extract CCS from all reports. Among 40 randomly selected CTA reports, 32 were coronary calcium score reports, and the LLM model successfully identified all clinical measures (i.e., vessel specific CCS and total CCS) correctly with 100% accuracy. Among the rest of 8 CT-FFR reports, only 2 reports had at least one vessel blockage ratio reported, and the LLM correctly captured all values. Applying the final LLM model to 498 patients who were referred to the Preventive Cardiology Institute during 12/06/2023-5/20/2025, we identified 560 total CCS reports with average total CCS of 147 (std=352), among which 24% had elevated CCS (i.e., 100+), 38.2% with score 0. Among 138 CT-FFR reports, 21 (15.2%) reports had at least one vessel blockage less than 0.8, indicating elevated risk for stroke or heart attack. Conclusion: Leveraging LLMs to unlock valuable cardiovascular data long hidden in EHR free-text information possesses potential to integrate structured and unstructured data to provide comprehensive clinical information to clinicians, facilitating proactive, data-driven care.
- Research Article
13
- 10.1016/j.jpainsymman.2024.11.016
- Mar 1, 2025
- Journal of Pain and Symptom Management
Large language models to identify advance care planning in patients with advanced cancer
- Research Article
15
- 10.7759/cureus.81871
- Apr 8, 2025
- Cureus
Background Previous research has highlighted the potential of large language models (LLMs) in answering multiple-choice questions (MCQs) in medical physiology. However, their accuracy and reliability in specialized fields, such as blood physiology, remain underexplored. This study evaluates the performance of six free-to-use LLMs (ChatGPT, Claude, DeepSeek, Gemini, Grok, and Le Chat) in solving item-analyzed MCQs on blood physiology. The findings aim to assess their suitability as educational aids. Methods This cross-sectional study at the All India Institute of Medical Sciences, Raebareli, India, involved administering a 40-item MCQ test on blood physiology to 75 first-year medical students. Item analysis utilized the Difficulty Index (DIF I), Discrimination Index (DI), and Distractor Effectiveness (DE). Internal consistency was assessed with the Kuder-Richardson 20 (KR-20) coefficient. These 40 item-analyzed MCQs were presented to six selected LLMs (ChatGPT, Claude, DeepSeek, Gemini, Grok, Le Chat) available as standalone Android applications on March 19, 2025. Three independent users accessed each LLM simultaneously, uploading the compiled MCQs in a Portable Document Format (PDF) file. Accuracy was determined as the percentage of correct responses averaged across all three users. Reliability was measured as the percentage of MCQs consistently answered correctly by LLM to all three users. Descriptive statistics were presented as mean ± standard deviation and percentages. Pearson's correlation coefficient or Spearman's rho was used to evaluate the associations between variables, with p < 0.05 considered significant. Results Item analysis confirmed the validity and reliability of the assessment tool, with a DIF I of 63.2 ± 20.4, a DI of 0.38 ± 0.20, a DE of 66.7 ± 33.3, and a KR-20 of 0.804. Among LLMs, Claude 3.7demonstrated the highest reliable accuracy (95%), followed by DeepSeek (93%), Grok 3 beta (93%), ChatGPT (90%), Gemini 2.0 (88%), and Mistral Le Chat (70%). No significant correlations were found between LLM performance and MCQ difficulty, discrimination power, or distractor effectiveness. Conclusions The MCQ assessment tool exhibited an appropriate difficulty level, strong discriminatory power, and adequately constructed distractors. LLMs, particularly Claude, DeepSeek, and Grok, demonstrated high accuracy and reliability in solving blood physiology MCQs, supporting their role as supplementary educational tools. LLMs handled questions of varying difficulty, discrimination power, and distractor effectiveness with similar competence. However, given occasional errors, they should be used alongside traditional teaching methods and expert supervision.
- Research Article
27
- 10.1016/j.omtn.2024.102255
- Jun 15, 2024
- Molecular Therapy - Nucleic Acids
Large language model to multimodal large language model: A journey to shape the biological macromolecules to biological sciences and medicine
- Research Article
- 10.52783/cana.v32.3965
- Feb 21, 2025
- Communications on Applied Nonlinear Analysis
Introduction: The increasing adoption of Large Language Models (LLMs) in healthcare necessitates a comprehensive review of their applications, limitations, and potential. Existing literature lacks a systematic assessment of LLM performance across diverse healthcare tasks and does not adequately address critical aspects such as model-specific optimizations, domain adaptability, and real-world deployment constraints. Objectives : This paper aims to fill the identified gaps by conducting an extensive and structured review of current research on LLM applications in medical reports, diagnostics, and decision-making. It seeks to classify and evaluate studies based on methods used, performance measures, key takeaways, strengths, and limitations. Methods : A PRISMA-based methodology was employed to systematically categorize studies according to their approaches and outcomes. The analysis focused on multiple LLMs, including GPT-3, GPT-4, BERT variants, Med-PaLM, and domain-specific adaptations such as BioGPT and COMCARE. For vision-language transformer-based auto-report generation, PEGASUS and ETB MII were examined. Additionally, the study explored KELLM for causal reasoning with knowledge graphs and OpenMedLM for equitable healthcare solutions. The selected models were evaluated based on key performance metrics such as accuracy, sensitivity, and explainability. Results : The findings indicate that specific LLMs show significant promise in enhancing healthcare applications. Models like Med-PaLM and BioGPT demonstrate improved diagnostic accuracy, while vision-language transformers such as PEGASUS enhance automated medical report generation. The integration of knowledge graphs in KELLM ensures greater interpretability and safety. Open-source models like OpenMedLM contribute to equitable access to AI-driven healthcare solutions. Overall, LLMs can reduce clinician workload, enhance diagnostic precision, and optimize healthcare workflows. Conclusion : This study highlights the transformative potential of LLMs in medicine while also addressing challenges such as ethical considerations, energy efficiency, and scalability. By providing a systematic evaluation, this review paves the way for future advancements in AI-driven healthcare applications, fostering innovation and improved patient care.