Autonomous medical evaluation for guideline adherence of large language models

Dennis Fast,Lisa C Adams,Felix Busch,Conor Fallon,Marc Huppertz,Robert Siepmann,Philipp Prucker,Nadine Bayerl,Daniel Truhn,Marcus Makowski,Alexander Löser,Keno K Bressem

doi:10.1038/s41746-024-01356-6

Abstract

Autonomous Medical Evaluation for Guideline Adherence (AMEGA) is a comprehensive benchmark designed to evaluate large language models’ adherence to medical guidelines across 20 diagnostic scenarios spanning 13 specialties. It includes an evaluation framework and methodology to assess models’ capabilities in medical reasoning, differential diagnosis, treatment planning, and guideline adherence, using open-ended questions that mirror real-world clinical interactions. It includes 135 questions and 1337 weighted scoring elements designed to assess comprehensive medical knowledge. In tests of 17 LLMs, GPT-4 scored highest with 41.9/50, followed closely by Llama-3 70B and WizardLM-2-8x22B. For comparison, a recent medical graduate scored 25.8/50. The benchmark introduces novel content to avoid the issue of LLMs memorizing existing medical data. AMEGA’s publicly available code supports further research in AI-assisted clinical decision-making, aiming to enhance patient care by aiding clinicians in diagnosis and treatment under time constraints.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Autonomous medical evaluation for guideline adherence of large language models

Abstract

Talk to us

Similar Papers

More From: npj Digital Medicine

Lead the way for us

Journal: npj Digital Medicine	Publication Date: Dec 12, 2024
License type: cc-by

Similar Papers

Lessons for local oversight of AI in medicine from the regulation of clinical laboratory testing
Daniel S Herman ... Gary E Weissman
npj Digital Medicine | VOL. 7
Daniel S Herman, et. al.Daniel S Herman ... Gary E Weissman
13 Dec 2024
npj Digital Medicine | VOL. 7

Autonomous medical evaluation for guideline adherence of large language models
Dennis Fast ... Keno K Bressem
npj Digital Medicine | VOL. 7
Dennis Fast, et. al.Dennis Fast ... Keno K Bressem
12 Dec 2024
npj Digital Medicine | VOL. 7

Predicting control of cardiovascular disease risk factors in South Asia using machine learning
Anna Reuter ... Nikkil Sudharsanan
npj Digital Medicine | VOL. 7
Anna Reuter, et. al.Anna Reuter ... Nikkil Sudharsanan
10 Dec 2024
npj Digital Medicine | VOL. 7

Deep learning biomarker of chronometric and biological ischemic stroke lesion age from unenhanced CT
Adam Marcus ... Paul Bentley
npj Digital Medicine | VOL. 7
Adam Marcus, et. al.Adam Marcus ... Paul Bentley
06 Dec 2024
npj Digital Medicine | VOL. 7

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Autonomous medical evaluation for guideline adherence of large language models

Abstract

Talk to us

Similar Papers

More From: npj Digital Medicine