O87: Stratified Evaluation of Large Language Model GPT-4’s Question-Answering In Surgery reveals AI Knowledge Gaps

Rebecca Murphy Lonergan,Jake Curry,Benno Simmons,Kallpana Dhas

doi:10.1093/bjs/znae046.050

Rebecca Murphy Lonergan, Jake Curry + Show 2 more

https://doi.org/10.1093/bjs/znae046.050

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

Abstract Introduction Large Language Models (LLMs), such as GPT, are artificial intelligence models designed to analyse vast data and generate coherent outputs, changing the way healthcare professionals access knowledge. Critically, LLMs also present incorrect information confidently, a phenomenon known as hallucination, which is particularly dangerous in the safety-critical field of medicine. The validity of LLM responses to medical queries are being explored generally, however, responses to surgical questions remain poorly quantified. Variations between specialties are important to identify to support strategic LLM improvements. Methods We assessed accuracy of GPT-3 and GPT-4 in answering surgical multi-choice questions from the MedMCQA post-graduate question bank. We calculated the percentage accuracy of GPT-4 on all surgical questions (n=23025) and compared this to published GPT-4 performance across the whole MedMCQA dataset. We also analysed variations in performance by topic on a randomised sample of questions manually sorted by surgical specialty (n=1000). Results Accuracy rates for GPT-3 and GPT-4 were 53% and 64% respectively, demonstrating significant superiority of GPT-4, however GPT-4's surgical performance remained weaker than its overall MedMCQA performance. Notably, accuracy varied significantly by specialty, with strong performances in anatomy, vascular and paediatric surgery but below-average performances in orthopaedics, ENT and neurosurgery. Conclusion This study holds significant implications for the expanding use of LLMs in surgery, especially education. GPT has improved accuracy with sequential developments, however, its performance requires further scrutiny. We recommend ongoing attention on factors underpinning subject-variation in performance to aid strategic LLM innovation.

Full Text