MedFrenchmark, a Small Set for Benchmarking Generative LLMs in Medical French.

Amandine Quercia,Jamil Zaghir,Christian Lovis,Christophe Gaudet-Blavignac

doi:10.3233/shti240486

Abstract

Generative Large Language Models (LLMs) have become ubiquitous in various fields, including healthcare and medicine. Consequently, there is growing interest in leveraging LLMs for medical applications, leading to the emergence of novel models daily. However, evaluation and benchmarking frameworks for LLMs are scarce, particularly those tailored for medical French. To address this gap, we introduce a minimal benchmark consisting of 114 open questions designed to assess the medical capabilities of LLMs in French. The proposed benchmark encompasses a wide range of medical domains, reflecting real-world clinical scenarios' complexity. A preliminary validation involved testing seven widely used LLMs with a parameter size of 7 billion. Results revealed significant variability in performance, emphasizing the importance of rigorous evaluation before deploying LLMs in medical settings. In conclusion, we present a novel and valuable resource for rapidly evaluating LLMs in medical French. By promoting greater accountability and standardization, this benchmark has the potential to enhance trustworthiness and utility in harnessing LLMs for medical applications.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

MedFrenchmark, a Small Set for Benchmarking Generative LLMs in Medical French.

Abstract

Talk to us

Similar Papers

More From: Studies in health technology and informatics

Lead the way for us

Journal: Studies in health technology and informatics	Publication Date: Aug 22, 2024
License type: cc-by-nc

Similar Papers

How Can IJDS Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
Galit Shmueli ... Bianca Maria Colosimo
INFORMS Journal on Data Science | VOL. 2
Galit Shmueli, et. al.Galit Shmueli ... Bianca Maria Colosimo
01 Apr 2023
INFORMS Journal on Data Science | VOL. 2

Data Set and Benchmark (MedGPTEval) to Evaluate Responses From Large Language Models in Medicine: Evaluation Development and Validation.
Jie Xu ... Shaoting Zhang
JMIR medical informatics | VOL. 12
Jie Xu, et. al.Jie Xu ... Shaoting Zhang
28 Jun 2024
JMIR medical informatics | VOL. 12

Performance of Large Language Models on a Neurology Board–Style Examination
Marc Cicero Schubert ... Varun Venkataramani
JAMA network open | VOL. 6
Marc Cicero Schubert, et. al.Marc Cicero Schubert ... Varun Venkataramani
07 Dec 2023
JAMA network open | VOL. 6

Evaluating large language models on medical, lay-language, and self-reported descriptions of genetic conditions
Kendall A Flaharty ... Benjamin D Solomon
The American Journal of Human Genetics | VOL. 111
Kendall A Flaharty, et. al.Kendall A Flaharty ... Benjamin D Solomon
14 Aug 2024
The American Journal of Human Genetics | VOL. 111

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

MedFrenchmark, a Small Set for Benchmarking Generative LLMs in Medical French.

Abstract

Talk to us

Similar Papers

More From: Studies in health technology and informatics