Abstract Large Language Models (LLMs) have been adopted increasingly in oncology, for example, in structuring data from clinical notes, inferring diagnoses from free text or imaging data, and anonymizing of data. Due to the rapid development pace of LLMs, best practices for conducting and reporting oncological research in these applications have yet to be fully established.We queried PubMed for oncology-related LLM research with the last cutoff set at Dec 31st 2024. We investigated 179 papers. Of these, 131 were removed due to omission criteria, and 48 were structured and reported here. Inclusion criteria were oncology-related research and full research articles. Structured fields included date of submission, acceptance, and publishing, the granularity of model reporting (model family, model snapshot), reporting of key LLM model parameters, availability of source code and data, and programming language and API details. We noted an almost exponential growth of LLM-related publications in oncology, with a relatively short time from authors’ submission to publicly available publication (median 3.7 months, IQR 2.5-5.9 months). Interestingly, despite the relatively short processing time, in 25% of cases, the exact model essential to the publication had been deprecated by the model service providers or a newer version was available at the time of publishing. 35.4% of published research relied solely on a graphical user-interface (GUI) of LLMs such as ChatGPT, while 37.5% reported programmatically API-use, with Python as the most common language. While most publications either fully or partially reported the utilized prompts (75%), only 22.9% reported the exact key model parameters, such as temperature. Even when the temperature parameter was available, 45.4% of these publications used a temperature value larger than 0, resulting in more stochastic answers. Source code was made publicly available in 18.7% of publications that reported using a programming language such as Python or R. While practically all publications (97.9%) reported the used model families such as GPT-4o, Claude 3.5 Sonnet or Llama 3-70B, only 27% reported the exact model snapshot usage such as GPT-4o with snapshot options available for May 13th, August 6th or November 20th in 2024. We exemplify and report shortcomings of recent LLM adoption in oncological research. To alleviate these issues, we propose a checklist to improve reproducibility, transparency, and longevity of LLM research directed at researchers and journals. We propose the following preliminary checklist: exact reporting of model snapshot and model parameter bound to a specific snapshot instead of latest release, API usage instead of GUI chatbots, temperature-parameter equal to 0, assessment of variability across runs, session restarts to avoid biases, and caution in researching models that are bound to be deprecated due to the short turn-around time in LLMs. Additionally, rigorous prompt engineering and especially few-shot learning show potential in optimizing interactions with LLMs, also in oncology. Citation Format: Tolou Shadbahr, Antti S. Rannikko, Tuomas Mirtti, Teemu D. Laajala. Current oncological large language model research lacks reproducibility, transparency, and long term support [abstract]. In: Proceedings of the AACR Special Conference in Cancer Research: Artificial Intelligence and Machine Learning; 2025 Jul 10-12; Montreal, QC, Canada. Philadelphia (PA): AACR; Clin Cancer Res 2025;31(13_Suppl):Abstract nr B021.
Read full abstract