Assessing unknown potential—quality and limitations of different large language models in the field of otorhinolaryngology

Christoph R Buhr,Christoph R Buhr,Harry Smith,Tilman Huppertz,Katharina Bahr-Hamm,Christoph Matthias,Clemens Cuny,Jan Phillipp Snijders,Benjamin Philipp Ernst,Andrew Blaikie,Tom Kelsey,Sebastian Kuhn,Jonas Eckrich

doi:10.1080/00016489.2024.2352843

Abstract

Background Large Language Models (LLMs) might offer a solution for the lack of trained health personnel, particularly in low- and middle-income countries. However, their strengths and weaknesses remain unclear. Aims/objectives Here we benchmark different LLMs (Bard 2023.07.13, Claude 2, ChatGPT 4) against six consultants in otorhinolaryngology (ORL). Material and methods Case-based questions were extracted from literature and German state examinations. Answers from Bard 2023.07.13, Claude 2, ChatGPT 4, and six ORL consultants were rated blindly on a 6-point Likert-scale for medical adequacy, comprehensibility, coherence, and conciseness. Given answers were compared to validated answers and evaluated for hazards. A modified Turing test was performed and character counts were compared. Results LLMs answers ranked inferior to consultants in all categories. Yet, the difference between consultants and LLMs was marginal, with the clearest disparity in conciseness and the smallest in comprehensibility. Among LLMs Claude 2 was rated best in medical adequacy and conciseness. Consultants’ answers matched the validated solution in 93% (228/246), ChatGPT 4 in 85% (35/41), Claude 2 in 78% (32/41), and Bard 2023.07.13 in 59% (24/41). Answers were rated as potentially hazardous in 10% (24/246) for ChatGPT 4, 14% (34/246) for Claude 2, 19% (46/264) for Bard 2023.07.13, and 6% (71/1230) for consultants. Conclusions and significance Despite consultants superior performance, LLMs show potential for clinical application in ORL. Future studies should assess their performance on larger scale.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Acta Oto-Laryngologica	Publication Date: Mar 1, 2024
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Assessing unknown potential—quality and limitations of different large language models in the field of otorhinolaryngology

Abstract

Talk to us

Similar Papers

More From: Acta Oto-Laryngologica

Lead the way for us

Similar Papers

How Can IJDS Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
Galit Shmueli ... W Nick Street
INFORMS Journal on Data Science | VOL. 2
Galit Shmueli, et. al.Galit Shmueli ... W Nick Street
01 Apr 2023
INFORMS Journal on Data Science | VOL. 2

Performance of Large Language Models on a Neurology Board–Style Examination
Marc Cicero Schubert ... Varun Venkataramani
JAMA network open | VOL. 6
Marc Cicero Schubert, et. al.Marc Cicero Schubert ... Varun Venkataramani
07 Dec 2023
JAMA network open | VOL. 6

Large Language Models Can Enable Inductive Thematic Analysis of a Social Media Corpus in a Single Prompt: Human Validation Study.
Michael S Deiner ... Urmimala Sarkar
JMIR infodemiology | VOL. 4
Michael S Deiner, et. al.Michael S Deiner ... Urmimala Sarkar
29 Aug 2024
JMIR infodemiology | VOL. 4

Urology consultants versus large language models: Potentials and hazards for medical advice in urology.
Johanna Eckrich ... Alexander Cox
BJUI Compass | VOL. 5
Johanna Eckrich, et. al.Johanna Eckrich ... Alexander Cox
03 Apr 2024
BJUI Compass | VOL. 5

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Assessing unknown potential—quality and limitations of different large language models in the field of otorhinolaryngology

Abstract

Talk to us

Similar Papers

More From: Acta Oto-Laryngologica