Systematic benchmarking demonstrates large language models have not reached the diagnostic accuracy of traditional rare-disease decision support tools.

Justin T Reese,Leonardo Chimirri,Daniel Danis,J Harry Caufield,Kyran Wissink,Elena Casiraghi,Giorgio Valentini,Melissa A Haendel,Christopher J Mungall,Peter N Robinson

doi:10.1101/2024.07.22.24310816

Abstract

Large language models (LLMs) show promise in supporting differential diagnosis, but their performance is challenging to evaluate due to the unstructured nature of their responses. To assess the current capabilities of LLMs to diagnose genetic diseases, we benchmarked these models on 5,213 case reports using the Phenopacket Schema, the Human Phenotype Ontology and Mondo disease ontology. Prompts generated from each phenopacket were sent to three generative pretrained transformer (GPT) models. The same phenopackets were used as input to a widely used diagnostic tool, Exomiser, in phenotype-only mode. The best LLM ranked the correct diagnosis first in 23.6% of cases, whereas Exomiser did so in 35.5% of cases. While the performance of LLMs for supporting differential diagnosis has been improving, it has not reached the level of commonly used traditional bioinformatics tools. Future research is needed to determine the best approach to incorporate LLMs into diagnostic pipelines.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Systematic benchmarking demonstrates large language models have not reached the diagnostic accuracy of traditional rare-disease decision support tools.

Abstract

Talk to us

Similar Papers

More From: medRxiv : the preprint server for health sciences

Lead the way for us

Journal: medRxiv : the preprint server for health sciences	Publication Date: Nov 7, 2024
License type: CC BY 4.0

Similar Papers

E-185 Customized generative pretrained transformer for simplified patient education of carotid angioplasty and stenting: a feasibility study
A Brake ... E Samaniego
Journal of NeuroInterventional Surgery | VOL. 16
A Brake, et. al.A Brake ... E Samaniego
01 Jul 2024
Journal of NeuroInterventional Surgery | VOL. 16

A guideline-informed language model for paediatric cardiology demonstrates high performance in answering complex medical questions
T Uden ... P Beerbaum
European Heart Journal | VOL. 45
T Uden, et. al.T Uden ... P Beerbaum
28 Oct 2024
European Heart Journal | VOL. 45

Evaluating the Performance of Large Language Models in Hematopoietic Stem Cell Transplantation Decision Making
Ivan Civettini ... Paola Perfetti
Blood | VOL. 142
Ivan Civettini, et. al.Ivan Civettini ... Paola Perfetti
02 Nov 2023
Blood | VOL. 142

The Accuracy and Capability of Artificial Intelligence Solutions in Health Care Examinations and Certificates: Systematic Review and Meta-Analysis.
William Joel Waldock ... Hutan Ashrafian
Journal of medical Internet research | VOL. 26
William Joel Waldock, et. al.William Joel Waldock ... Hutan Ashrafian
05 Nov 2024
Journal of medical Internet research | VOL. 26

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Systematic benchmarking demonstrates large language models have not reached the diagnostic accuracy of traditional rare-disease decision support tools.

Abstract

Talk to us

Similar Papers

More From: medRxiv : the preprint server for health sciences