Large language models in pathology: A comparative study of ChatGPT and Bard with pathology trainees on multiple-choice questions

Wei Du,Xueting Jin,Jaryse Carol Harris,Alessandro Brunetti,Erika Johnson,Olivia Leung,Xingchen Li,Selemon Walle,Qing Yu,Xiao Zhou,Fang Bian,Kajanna Mckenzie,Manita Kanathanavanich,Yusuf Ozcelik,Farah El-Sharkawy,Shunsuke Koga

doi:10.1016/j.anndiagpath.2024.152392

Abstract

Large language models (LLMs), such as ChatGPT and Bard, have shown potential in various medical applications. This study aimed to evaluate the performance of LLMs, specifically ChatGPT and Bard, in pathology by comparing their performance with those of pathology trainees, and to assess the consistency of their responses. We selected 150 multiple-choice questions from 15 subspecialties, excluding those with images. Both ChatGPT and Bard were tested on these questions across three separate sessions between June 2023 and January 2024, and their responses were compared with those of 16 pathology trainees (8 junior and 8 senior) from two hospitals. Questions were categorized into easy, intermediate, and difficult based on trainee performance. Consistency and variability in LLM responses were analyzed across three evaluation sessions. ChatGPT significantly outperformed Bard and trainees, achieving an average total score of 82.2% compared to Bard's 49.5%, junior trainees' 45.1%, and senior trainees' 56.0%. ChatGPT's performance was notably stronger in difficult questions (63.4%–68.3%) compared to Bard (31.7%–34.1%) and trainees (4.9%–48.8%). For easy questions, ChatGPT (83.1%–91.5%) and trainees (73.7%–100.0%) showed similar high scores. Consistency analysis revealed that ChatGPT showed a high consistency rate of 80%–85% across three tests, whereas Bard exhibited greater variability with consistency rates of 54%–61%. While LLMs show significant promise in pathology education and practice, continued development and human oversight are crucial for reliable clinical application.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Large language models in pathology: A comparative study of ChatGPT and Bard with pathology trainees on multiple-choice questions

Abstract

Talk to us

Similar Papers

More From: Annals of Diagnostic Pathology

Lead the way for us

Similar Papers

Assessing the research landscape and clinical utility of large language models: a scoping review.
Ye-Jean Park ... Christopher Naugler
BMC Medical Informatics and Decision Making | VOL. 24
Ye-Jean Park, et. al.Ye-Jean Park ... Christopher Naugler
12 Mar 2024
BMC Medical Informatics and Decision Making | VOL. 24

Senior trainee as endoscopy teacher: impact on trainee learning and attending experience
Colin Feuille ... Justin L Sewell
Frontline Gastroenterology | VOL. 15
Colin Feuille, et. al.Colin Feuille ... Justin L Sewell
29 Jun 2023
Frontline Gastroenterology | VOL. 15

Board 300 - Research Abstract Validating a Behaviour Assessment Tool for Simulated Neonatal Environment (Submission #765)
Asim Ahmed ... Helen Moore
Simulation in Healthcare: The Journal of the Society for Simulation in Healthcare | VOL. 8
Asim Ahmed, et. al.Asim Ahmed ... Helen Moore
01 Dec 2013
Board 300 - Research Abstract Validating a Behaviour Assessment Tool for Simulated Neonatal Environment (Submission #765)
Asim Ahmed ... Helen Moore

Harnessing the Power of Generative Artificial Intelligence in Pathology Education.
Matthew J Cecchini ... Scott R Anderson
Archives of pathology & laboratory medicine | VOL. -
Matthew J Cecchini, et. al.Matthew J Cecchini ... Scott R Anderson
30 Sep 2024
Archives of pathology & laboratory medicine | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Large language models in pathology: A comparative study of ChatGPT and Bard with pathology trainees on multiple-choice questions

Abstract

Talk to us

Similar Papers

More From: Annals of Diagnostic Pathology