Performance of a large language model in lymphoma stage assignment based on written PET/CT reports.

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Performance of a large language model in lymphoma stage assignment based on written PET/CT reports.

Similar Papers
  • Research Article
  • Cite Count Icon 53
  • 10.1001/jamanetworkopen.2023.46721
Performance of Large Language Models on a Neurology Board–Style Examination
  • Dec 7, 2023
  • JAMA network open
  • Marc Cicero Schubert + 2 more

Recent advancements in large language models (LLMs) have shown potential in a wide array of applications, including health care. While LLMs showed heterogeneous results across specialized medical board examinations, the performance of these models in neurology board examinations remains unexplored. To assess the performance of LLMs on neurology board-style examinations. This cross-sectional study was conducted between May 17 and May 31, 2023. The evaluation utilized a question bank approved by the American Board of Psychiatry and Neurology and was validated with a small question cohort by the European Board for Neurology. All questions were categorized into lower-order (recall, understanding) and higher-order (apply, analyze, synthesize) questions based on the Bloom taxonomy for learning and assessment. Performance by LLM ChatGPT versions 3.5 (LLM 1) and 4 (LLM 2) was assessed in relation to overall scores, question type, and topics, along with the confidence level and reproducibility of answers. Overall percentage scores of 2 LLMs. LLM 2 significantly outperformed LLM 1 by correctly answering 1662 of 1956 questions (85.0%) vs 1306 questions (66.8%) for LLM 1. Notably, LLM 2's performance was greater than the mean human score of 73.8%, effectively achieving near-passing and passing grades in the neurology board examination. LLM 2 outperformed human users in behavioral, cognitive, and psychological-related questions and demonstrated superior performance to LLM 1 in 6 categories. Both LLMs performed better on lower-order than higher-order questions, with LLM 2 excelling in both lower-order and higher-order questions. Both models consistently used confident language, even when providing incorrect answers. Reproducible answers of both LLMs were associated with a higher percentage of correct answers than inconsistent answers. Despite the absence of neurology-specific training, LLM 2 demonstrated commendable performance, whereas LLM 1 performed slightly below the human average. While higher-order cognitive tasks were more challenging for both models, LLM 2's results were equivalent to passing grades in specialized neurology examinations. These findings suggest that LLMs could have significant applications in clinical neurology and health care with further refinements.

  • Research Article
  • Cite Count Icon 7
  • 10.1093/mr/road115
Large language model may assist diagnosis of SAPHO syndrome by bone scintigraphy.
  • Dec 28, 2023
  • Modern rheumatology
  • Yu Mori + 4 more

In this study, we employed a large language model to evaluate the diagnostic efficacy of radiology reports of bone scintigraphy in the context of identifying SAPHO syndrome, and further examined the potential of such a model to augment the diagnostic procedure. Imaging data and clinical information of 151 patients (105/46 women/men, mean age: 53.5 years) who underwent bone scintigraphy for suspected Synovitis, Acne, Pustulosis, Hyperostosis, and Osteitis (SAPHO) syndrome between January 2007 and December 2022 were retrospectively reviewed. ChatGPT-4.0 was used as the large language model. The diagnostic performance of the large language model was verified by comparing the cases judged to have SAPHO syndrome that fulfilled Kahn's classification criteria based on a combination of concise radiology reports and skin lesions such as palmoplantar pustulosis, with cases diagnosed with SAPHO syndrome by rheumatologists based on all clinical information. The diagnostic accuracy of a large language model for analysing bone scintigraphy radiology reports in conjunction with information about skin symptoms, such as palmoplantar pustulosis, achieved a sensitivity of 83.5%, specificity of 69.4%, and an overall accuracy of 76.8%. This research indicates the prospective value of extensive language models in scrutinizing radiology accounts from bone scintigraphy for the diagnosis of SAPHO syndrome.

  • Research Article
  • 10.1007/s00276-025-03723-8
Evaluation of the performance of different large language models on head and neck anatomy questions in the dentistry specialization exam in Turkey.
  • Sep 22, 2025
  • Surgical and radiologic anatomy : SRA
  • Busra Nur Gokkurt Yilmaz + 2 more

The aim of this study was to assess the performance of various Large Language Models (LLMs) in addressing head and neck anatomy questions from the Dental Specialization Exam (DUS), conducted between 2012 and 2021. A total of 103 multiple-choice questions were selected from DUS examinations over a decade. These questions covered major topics: Musculoskeletal System, Nervous System and Sensory Organs, Dental Anatomy, and Veins, Arteries, Lymphatic System and the Glandular System. Eight of the LLMs Gemini 1.5, Gemini 2, Copilot, Deepseek, Claude, ChatGPT 4o, ChatGPT 4, and ChatGPT o1 were employed using their most updated versions. Each model's accuracy was calculated by comparing the number of correct and incorrect responses. The ChatGPT o1 demonstrated the highest accuracy rate among all tested models, while Gemini 1.5 showed the lowest accuracy. These differences were found to be statistically significant (p = 0.027). Post-hoc analysis revealed that the only statistically significant difference among the LLMs was between ChatGPT o1 and Gemini 1.5 (p < 0.0031). When questions were analyzed by topic, no significant accuracy differences emerged in the Musculoskeletal System section. However, the ChatGPT o1 performed best in the Nervous System and Sensory Organs category. For Dental Anatomy questions, both ChatGPT o1 and Copilot achieved top results, and for Veins, Arteries, Lymphatic System and Glandular System section, ChatGPT o1 again excelled. Overall, the findings show that LLMs effectively answer DUS head and neck anatomy questions with comparable performance. These insights support future exam-related model development and suggest that LLMs can serve as valuable educational tools.

  • Research Article
  • Cite Count Icon 3
  • 10.1186/s12859-025-06081-9
Comparative Assessment of Protein Large Language Models for Enzyme Commission Number Prediction
  • Feb 27, 2025
  • BMC Bioinformatics
  • João Capela + 5 more

Background: Protein large language models (LLM) have been used to extract representations of enzyme sequences to predict their function, which is encoded by enzyme commission (EC) numbers. However, a comprehensive comparison of different LLMs for this task is still lacking, leaving questions about their relative performance. Moreover, protein sequence alignments (e.g. BLASTp or DIAMOND) are often combined with machine learning models to assign EC numbers from homologous enzymes, thus compensating for the shortcomings of these models’ predictions. In this context, LLMs and sequence alignment methods have not been extensively compared as individual predictors, raising unaddressed questions about LLMs’ performance and limitations relative to the alignment methods. In this study, we set out to assess the performance of ESM2, ESM1b, and ProtBERT language models in their ability to predict EC numbers, comparing them with BLASTp, against each other and against models that rely on one-hot encodings of amino acid sequences.Results: Our findings reveal that combining these LLMs with fully connected neural networks surpasses the performance of deep learning models that rely on one-hot encodings. Moreover, although BLASTp provided marginally better results overall, DL models provide results that complement BLASTp’s, revealing that LLMs better predict certain EC numbers while BLASTp excels in predicting others. The ESM2 stood out as the best model among the LLMs tested, providing more accurate predictions on difficult annotation tasks and for enzymes without homologs.Conclusions: Crucially, this study demonstrates that LLMs still have to be improved to become the gold standard tool over BLASTp in mainstream enzyme annotation routines. On the other hand, LLMs can provide good predictions for more difficult-to-annotate enzymes, particularly when the identity between the query sequence and the reference database falls below 25%. Our results reinforce the claim that BLASTp and LLM models complement each other and can be more effective when used together.

  • Research Article
  • 10.20965/jdr.2025.p0386
Autonomous Epidemic and Geographic Disaster Mapping: Assessing the Performance of Large Language Models in Spatial Information Integration
  • Jun 1, 2025
  • Journal of Disaster Research
  • Wan-Chih Lin + 1 more

This study aims to evaluate the performance of various large language models (LLMs) in generating dengue fever epidemic and earthquake intensity maps through the integration of spatial information technology. By combining natural language processing techniques, this paper presents an innovative method to extract real-time data related to dengue fever and earthquake events, which is then used to generate corresponding geographic information maps, thereby improving real-time monitoring and disaster management efficiency. The research designed a series of detailed prompts, including topic descriptions, data sources, analysis objectives, and specific requirements, to test the capabilities of multiple LLMs in the code generation process. The codes generated by these models were further used to map the geographic distribution of dengue fever outbreaks and earthquake intensities in Taiwan. Subsequently, the codes were evaluated on accuracy, operational efficiency, and the clarity of the visualized results. The findings revealed that in addition to ChatGPT, models such as Copilot, Claude, and Nxcode-CQ-7B-orpo also excelled at generating precise and efficient maps. These LLMs are capable of automating the processing of large amounts of data and generating visualized charts with decision support functions, significantly reducing the time and labor costs associated with traditional manual operations. In addition, this innovative approach provides a new technical pathway for real-time geographic disaster monitoring and management. The results underscore the value of integrating LLMs with spatial information technology, offering new research directions for geographic information systems applications and providing robust technical support for disaster response and public health management.

  • Research Article
  • Cite Count Icon 3
  • 10.1097/spv.0000000000001545
Comparative Analysis of Performance of Large Language Models in Urogynecology.
  • Jun 27, 2024
  • Urogynecology (Philadelphia, Pa.)
  • Ghanshyam S Yadav + 4 more

Despite growing popularity in medicine, data on large language models in urogynecology are lacking. The aim of this study was to compare the performance of ChatGPT-3.5, GPT-4, and Bard on the American Urogynecologic Society self-assessment examination. The examination features 185 questions with a passing score of 80. We tested 3 models-ChatGPT-3.5, GPT-4, and Bard on every question. Dedicated accounts enabled controlled comparisons. Questions with prompts were inputted into each model's interface, and responses were evaluated for correctness, logical reasoning behind answer choice, and sourcing. Data on subcategory, question type, correctness rate, question difficulty, and reference quality were noted. The Fisher exact or χ 2 test was used for statistical analysis. Out of 185 questions, GPT-4 answered 61.6% questions correctly compared with 54.6% for GPT-3.5 and 42.7% for Bard. GPT-4 answered all questions, whereas GPT-3.5 and Bard declined to answer 4 and 25 questions, respectively. All models demonstrated logical reasoning in their correct responses. Performance of all large language models was inversely proportional to the difficulty level of the questions. Bard referenced sources 97.5% of the time, more often than GPT-4 (83.3%) and GPT-3.5 (39%). GPT-3.5 cited books and websites, whereas GPT-4 and Bard additionally cited journal articles and society guidelines. Median journal impact factor and number of citations were 3.6 with 20 citations for GPT-4 and 2.6 with 25 citations for Bard. Although GPT-4 outperformed GPT-3.5 and Bard, none of the models achieved a passing score. Clinicians should use language models cautiously in patient care scenarios until more evidence emerges.

  • Research Article
  • 10.1038/s41698-025-00916-7
Evaluating the performance of large language & visual-language models in cervical cytology screening
  • May 23, 2025
  • npj Precision Oncology
  • Qi Hong + 15 more

Large language models (LLMs) and large visual-language models (LVLMs) have exhibited near-human levels of knowledge, image comprehension, and reasoning abilities, and their performance has undergone evaluation in some healthcare domains. However, a systematic evaluation of their capabilities in cervical cytology screening has yet to be conducted. Here, we constructed CCBench, a benchmark dataset dedicated to the evaluation of LLMs and LVLMs in cervical cytology screening, and developed a GPT-based semi-automatic evaluation pipeline to assess the performance of six LLMs (GPT-4, Bard, Claude-2.0, LLaMa-2, Qwen-Max, and ERNIE-Bot-4.0) and five LVLMs (GPT-4V, Gemini, LLaVA, Qwen-VL, and ViLT) on this dataset. CCBench comprises 773 question-answer (QA) pairs and 420 visual-question-answer (VQA) triplets, making it the first dataset in cervical cytology to include both QA and VQA data. We found that LLMs and LVLMs demonstrate promising accuracy and specialization in cervical cytology screening. GPT-4 achieved the best performance on the QA dataset, with an accuracy of 70.5% for close-ended questions and average expert evaluation score of 6.9/10 for open-ended questions. On the VQA dataset, Gemini achieved the highest accuracy for close-ended questions at 67.8%, while GPT-4V attained the highest expert evaluation score of 6.1/10 for open-ended questions. Besides, LLMs and LVLMs revealed varying abilities in answering questions across different topics and difficulty levels. However, their performance remains inferior to the expertise exhibited by cytopathology professionals, and the risk of generating misinformation could lead to potential harm. Therefore, substantial improvements are required before these models can be reliably deployed in clinical practice.

  • Research Article
  • Cite Count Icon 4
  • 10.3389/fdsfr.2024.1379260
Assessing the performance of large language models in literature screening for pharmacovigilance: a comparative study
  • Jun 27, 2024
  • Frontiers in Drug Safety and Regulation
  • Dan Li + 7 more

Pharmacovigilance plays a crucial role in ensuring the safety of pharmaceutical products. It involves the systematic monitoring of adverse events and the detection of potential safety concerns related to drugs. Manual literature screening for pharmacovigilance related articles is a labor-intensive and time-consuming task, requiring streamlined solutions to cope with the continuous growth of literature. The primary objective of this study is to assess the performance of Large Language Models (LLMs) in automating literature screening for pharmacovigilance, aiming to enhance the process by identifying relevant articles more effectively. This study represents a novel application of LLMs including OpenAI’s GPT-3.5, GPT-4, and Anthropic’s Claude2, in the field of pharmacovigilance, evaluating their ability to categorize medical publications as relevant or irrelevant for safety signal reviews. Our analysis encompassed N-shot learning, chain-of-thought reasoning, and evaluating metrics, with a focus on factors impacting accuracy. The findings highlight the promising potential of LLMs in literature screening, achieving a reproducibility of 93%, sensitivity of 97%, and specificity of 67% showcasing notable strengths in terms of reproducibility and sensitivity, although with moderate specificity. Notably, performance improved when models were provided examples consisting of abstracts, labels, and corresponding reasoning explanations. Moreover, our exploration identified several potential contributing factors influencing prediction outcomes. These factors encompassed the choice of key words and prompts, the balance of the examples, and variations in reasoning explanations. By configuring advanced LLMs for efficient screening of extensive literature databases, this study underscores the transformative potential of these models in drug safety monitoring. Furthermore, these insights gained from this study can inform the development of automated systems for pharmacovigilance, contributing to the ongoing efforts to ensure the safety and efficacy of pharmacovigilance products.

  • Research Article
  • Cite Count Icon 37
  • 10.1148/ryai.230364
Performance of an Open-Source Large Language Model in Extracting Information from Free-Text Radiology Reports.
  • May 8, 2024
  • Radiology. Artificial intelligence
  • Bastien Le Guellec + 9 more

Purpose To assess the performance of a local open-source large language model (LLM) in various information extraction tasks from real-life emergency brain MRI reports. Materials and Methods All consecutive emergency brain MRI reports written in 2022 from a French quaternary center were retrospectively reviewed. Two radiologists identified MRI scans that were performed in the emergency department for headaches. Four radiologists scored the reports' conclusions as either normal or abnormal. Abnormalities were labeled as either headache-causing or incidental. Vicuna (LMSYS Org), an open-source LLM, performed the same tasks. Vicuna's performance metrics were evaluated using the radiologists' consensus as the reference standard. Results Among the 2398 reports during the study period, radiologists identified 595 that included headaches in the indication (median age of patients, 35 years [IQR, 26-51 years]; 68% [403 of 595] women). A positive finding was reported in 227 of 595 (38%) cases, 136 of which could explain the headache. The LLM had a sensitivity of 98.0% (95% CI: 96.5, 99.0) and specificity of 99.3% (95% CI: 98.8, 99.7) for detecting the presence of headache in the clinical context, a sensitivity of 99.4% (95% CI: 98.3, 99.9) and specificity of 98.6% (95% CI: 92.2, 100.0) for the use of contrast medium injection, a sensitivity of 96.0% (95% CI: 92.5, 98.2) and specificity of 98.9% (95% CI: 97.2, 99.7) for study categorization as either normal or abnormal, and a sensitivity of 88.2% (95% CI: 81.6, 93.1) and specificity of 73% (95% CI: 62, 81) for causal inference between MRI findings and headache. Conclusion An open-source LLM was able to extract information from free-text radiology reports with excellent accuracy without requiring further training. Keywords: Large Language Model (LLM), Generative Pretrained Transformers (GPT), Open Source, Information Extraction, Report, Brain, MRI Supplemental material is available for this article. Published under a CC BY 4.0 license. See also the commentary by Akinci D'Antonoli and Bluethgen in this issue.

  • Research Article
  • Cite Count Icon 13
  • 10.1111/1742-6723.14280
Will code one day run a code? Performance of language models on ACEM primary examinations and implications.
  • Jul 6, 2023
  • Emergency Medicine Australasia
  • Jesse Smith + 2 more

Large language models (LLMs) have demonstrated mixed results in their ability to pass various specialist medical examination and their performance within the field of emergency medicine remains unknown. We explored the performance of three prevalent LLMs (OpenAI's GPT series, Google's Bard, and Microsoft's Bing Chat) on a practice ACEM primary examination. All LLMs achieved a passing score, with scores with GPT 4.0 outperforming the average candidate. Large language models, by passing the ACEM primary examination, show potential as tools for medical education and practice. However, limitations exist and are discussed.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 13
  • 10.3390/diagnostics14141491
Evaluating Large Language Model (LLM) Performance on Established Breast Classification Systems.
  • Jul 11, 2024
  • Diagnostics (Basel, Switzerland)
  • Syed Ali Haider + 6 more

Medical researchers are increasingly utilizing advanced LLMs like ChatGPT-4 and Gemini to enhance diagnostic processes in the medical field. This research focuses on their ability to comprehend and apply complex medical classification systems for breast conditions, which can significantly aid plastic surgeons in making informed decisions for diagnosis and treatment, ultimately leading to improved patient outcomes. Fifty clinical scenarios were created to evaluate the classification accuracy of each LLM across five established breast-related classification systems. Scores from 0 to 2 were assigned to LLM responses to denote incorrect, partially correct, or completely correct classifications. Descriptive statistics were employed to compare the performances of ChatGPT-4 and Gemini. Gemini exhibited superior overall performance, achieving 98% accuracy compared to ChatGPT-4's 71%. While both models performed well in the Baker classification for capsular contracture and UTSW classification for gynecomastia, Gemini consistently outperformed ChatGPT-4 in other systems, such as the Fischer Grade Classification for gender-affirming mastectomy, Kajava Classification for ectopic breast tissue, and Regnault Classification for breast ptosis. With further development, integrating LLMs into plastic surgery practice will likely enhance diagnostic support and decision making.

  • Research Article
  • Cite Count Icon 6
  • 10.18653/v1/2020.eval4nlp-1.13
Are Some Words Worth More than Others?
  • Jan 1, 2020
  • Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing
  • Shiran Dudy + 1 more

Current evaluation metrics for language modeling and generation rely heavily on the accuracy of predicted (or generated) words as compared to a reference ground truth. While important, token-level accuracy only captures one aspect of a language model's behavior, and ignores linguistic properties of words that may allow some mis-predicted tokens to be useful in practice. Furthermore, statistics directly tied to prediction accuracy (including perplexity) may be confounded by the Zipfian nature of written language, as the majority of the prediction attempts will occur with frequently-occurring types. A model's performance may vary greatly between high- and low-frequency words, which in practice could lead to failure modes such as repetitive and dull generated text being produced by a downstream consumer of a language model. To address this, we propose two new intrinsic evaluation measures within the framework of a simple word prediction task that are designed to give a more holistic picture of a language model's performance. We evaluate several commonly-used large English language models using our proposed metrics, and demonstrate that our approach reveals functional differences in performance between the models that are obscured by more traditional metrics.

  • Research Article
  • 10.1609/aaai.v39i24.34761
Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model
  • Apr 11, 2025
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Zhen Ye + 11 more

Recent advancements in audio generation have been significantly propelled by the capabilities of Large Language Models (LLMs). The existing research on audio LLM has primarily focused on enhancing the architecture and scale of audio language models, as well as leveraging larger datasets, and generally, acoustic codecs, such as EnCodec, are used for audio tokenization. However, these codecs were originally designed for audio compression, which may lead to suboptimal performance in the context of audio LLM. Our research aims to address the shortcomings of current audio LLM codecs, particularly their challenges in maintaining semantic integrity in generated audio. For instance, existing methods like VALL-E, which condition acoustic token generation on text transcriptions, often suffer from content inaccuracies and elevated word error rates (WER) due to semantic misinterpretations of acoustic tokens, resulting in word skipping and errors. To overcome these issues, we propose a straightforward yet effective approach called X-Codec. X-Codec incorporates semantic features from a pre-trained semantic encoder before the Residual Vector Quantization (RVQ) stage and introduces a semantic reconstruction loss after RVQ. By enhancing the semantic ability of the codec, X-Codec significantly reduces WER in speech synthesis tasks and extends these benefits to non-speech applications, including music and sound generation. Our experiments in text-to-speech, music continuation, and text-to-sound tasks demonstrate that integrating semantic information substantially improves the overall performance of language models in audio generation.

  • Research Article
  • 10.1680/jstbu.24.00139
Evaluating performance of large language models on fundamental structural knowledge
  • Oct 21, 2025
  • Proceedings of the Institution of Civil Engineers - Structures and Buildings
  • Yi Zhang + 1 more

This paper evaluates the performance of multimodal large language models (MLLMs), ChatGPT-4o and Gemini in structural engineering with the image-input approach for the first time in the literature. The study uses 90 real visual-based fundamental mechanics questions that span five subtopics to examine the accuracy of ChatGPT and Gemini and compare them to the response of students. The results show that the students perform better than two MLLMs, although the performance of ChatGPT-4o is closely aligned with student performance or exceeds their performance on topics of kinematics and kinetics, whereas Gemini has the worst performance. However, the study also reveals the limitations of LLMs in accurately extracting mathematical and mechanical information from structural images, as well as their inadequacy in calculations and ensuring logical relationships. Evaluating the capabilities of MLLMs shows immense potential for structural engineering applications, while fine-tuning for specific tasks and developing artificial intelligence agents with MLLMs as the core represent critical directions for future research.

  • Research Article
  • 10.1007/s10792-025-03587-2
Performance of a novel multimodal large language model in ınterpreting meibomian glands quantitatively and qualitatively.
  • May 28, 2025
  • International ophthalmology
  • Pelin Kiyat + 1 more

To evaluate the performance of a multimodal large language model (LLM), Claude 3.5 Sonnet, in interpreting meibography images for Meibomian gland dropout grading and morphological abnormality detection. A total of 228 meibography images were analyzed by the same researcher and an assessment was performed in terms of gland drop out ratio and morphological abnormalities. Meibomian gland loss was graded from 0 (no loss) to 3 (> 2/3 loss of total gland area). One-hundred and sixty images, comprising 40 images per grade, were included. Claude 3.5 Sonnet, a multimodel LLM, developed by Anthropic (California, United States) was utilized to investigate its performance in evaluating meibography images. Claude 3.5 Sonnet showed high performance in grading Meibomian gland dropout, correctly scoring 97.5%, 92.5%, 95%, and 85% of images in Grades 0, 1, 2, and 3, respectively. In addition, Claude 3.5 Sonnet showed remarkable performance in detecting morphological abnormalities, including heterogeneous lumen diameters, lumen tortuosity, shortened lumen length, and hyperreflective gland residues. The model detected all of the 48 manually identified morphological abnormalities accurately. In 12 images, initially classified as morphologically normal by the manual assessment, the model reported additional subtle abnormalities. Claude 3.5 Sonnet showed promising results in interpreting meibography images, detecting morphological abnormalities and discriminating normal Meibomian glands from abnormal. Claude 3.5 Sonnet might be useful in serving as a complementary educational tool in ophthalmology clinics. The model's ability to perform detailed morphological evaluations and respond to further questions provides a tailored learning experience for young ophthalmic clinicians.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon