Performance of large language models at the MRCS Part A: a tool for medical education?

A Yiu,K Lam

doi:10.1308/rcsann.2023.0085

A Yiu, K Lam

https://doi.org/10.1308/rcsann.2023.0085

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

The Intercollegiate Membership of the Royal College of Surgeons examination (MRCS) Part A assesses generic surgical sciences and applied knowledge using 300 multiple-choice Single Best Answer items. Large Language Models (LLMs) are trained on vast amounts of text to generate natural language outputs, and applications in healthcare and medical education are rising. Two LLMs, ChatGPT (OpenAI) and Bard (Google AI), were tested using 300 questions from a popular MRCS Part A question bank without/with need for justification (NJ/J). LLM outputs were scored according to accuracy, concordance and insight. ChatGPT achieved 85.7%/84.3% accuracy for NJ/J encodings. Bard achieved 64%/64.3% accuracy for NJ/J encodings. ChatGPT and Bard displayed high levels of concordance for NJ (95.3%; 81.7%) and J (93.7%; 79.7%) encodings, respectively. ChatGPT and Bard provided an insightful statement in >98% and >86% outputs, respectively. This study demonstrates that ChatGPT achieves passing-level accuracy at MRCS Part A, and both LLMs achieve high concordance and provide insightful responses to test questions. Instances of clinically inappropriate or inaccurate decision-making, incomplete appreciation of nuanced clinical scenarios and utilisation of out-of-date guidance was, however, noted. LLMs are accessible and time-efficient tools, access vast clinical knowledge, and may reduce the emphasis on factual recall in medical education and assessment. ChatGPT achieves passing-level accuracy for MRCS Part A with concordant and insightful outputs. Future applications of LLMs in healthcare must be cautious of hallucinations and incorrect reasoning but have the potential to develop AI-supported clinicians.

Full Text