Background Large language models (LLMs) for medical applications use unknown amounts of energy, which contribute to the overall carbon footprint of the health care system. Purpose To investigate the tradeoffs between accuracy and energy use when using different LLM types and sizes for medical applications. Materials and Methods This retrospective study evaluated five different billion (B)-parameter sizes of two open-source LLMs (Meta's Llama 2, a general-purpose model, and LMSYS Org's Vicuna 1.5, a specialized fine-tuned model) using chest radiograph reports from the National Library of Medicine's Indiana University Chest X-ray Collection. Reports with missing demographic information and missing or blank files were excluded. Models were run on local compute clusters with visual computing graphic processing units. A single-task prompt explained clinical terminology and instructed each model to confirm the presence or absence of each of the 13 CheXpert disease labels. Energy use (in kilowatt-hours) was measured using an open-source tool. Accuracy was assessed with 13 CheXpert reference standard labels for diagnostic findings on chest radiographs, where overall accuracy was the mean of individual accuracies of all 13 labels. Efficiency ratios (accuracy per kilowatt-hour) were calculated for each model type and size. Results A total of 3665 chest radiograph reports were evaluated. The Vicuna 1.5 7B and 13B models had higher efficiency ratios (737.28 and 331.40, respectively) and higher overall labeling accuracy (93.83% [3438.69 of 3665 reports] and 93.65% [3432.38 of 3665 reports], respectively) than that of the Llama 2 models (7B: efficiency ratio of 13.39, accuracy of 7.91% [289.76 of 3665 reports]; 13B: efficiency ratio of 40.90, accuracy of 74.08% [2715.15 of 3665 reports]; 70B: efficiency ratio of 22.30, accuracy of 92.70% [3397.38 of 3665 reports]). Vicuna 1.5 7B had the highest efficiency ratio (737.28 vs 13.39 for Llama 2 7B). The larger Llama 2 70B model used more than seven times the energy of its 7B counterpart (4.16 kWh vs 0.59 kWh) with low overall accuracy, resulting in an efficiency ratio of only 22.30. Conclusion Smaller fine-tuned LLMs were more sustainable than larger general-purpose LLMs, using less energy without compromising accuracy, highlighting the importance of LLM selection for medical applications. © RSNA, 2024 Supplemental material is available for this article.
Read full abstract