Assessment of Resident and AI Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: Comparative Study.

Ryan St Huang,Kevin Jia Qi Lu,Christopher Meaney,Joel Kemppainen,Angela Punnett,Fok-Han Leung

doi:10.2196/50514

Ryan St Huang, Kevin Jia Qi Lu + Show 4 more

https://doi.org/10.2196/50514

Copy DOI

Export

Save

Cite

Journal: JMIR medical education	Publication Date: Sep 19, 2023
Citations: 25	License type: cc-by

Abstract
Full-Text
Similar Papers

Abstract

Listen

Large language model (LLM)-based chatbots are evolving at an unprecedented pace with the release of ChatGPT, specifically GPT-3.5, and its successor, GPT-4. Their capabilities in general-purpose tasks and language generation have advanced to the point of performing excellently on various educational examination benchmarks, including medical knowledge tests. Comparing the performance of these 2 LLM models to that of Family Medicine residents on a multiple-choice medical knowledge test can provide insights into their potential as medical education tools. This study aimed to quantitatively and qualitatively compare the performance of GPT-3.5, GPT-4, and Family Medicine residents in a multiple-choice medical knowledge test appropriate for the level of a Family Medicine resident. An official University of Toronto Department of Family and Community Medicine Progress Test consisting of multiple-choice questions was inputted into GPT-3.5 and GPT-4. The artificial intelligence chatbot's responses were manually reviewed to determine the selected answer, response length, response time, provision of a rationale for the outputted response, and the root cause of all incorrect responses (classified into arithmetic, logical, and information errors). The performance of the artificial intelligence chatbots were compared against a cohort of Family Medicine residents who concurrently attempted the test. GPT-4 performed significantly better compared to GPT-3.5 (difference 25.0%, 95% CI 16.3%-32.8%; McNemar test: P<.001); it correctly answered 89/108 (82.4%) questions, while GPT-3.5 answered 62/108 (57.4%) questions correctly. Further, GPT-4 scored higher across all 11 categories of Family Medicine knowledge. In 86.1% (n=93) of the responses, GPT-4 provided a rationale for why other multiple-choice options were not chosen compared to the 16.7% (n=18) achieved by GPT-3.5. Qualitatively, for both GPT-3.5 and GPT-4 responses, logical errors were the most common, while arithmetic errors were the least common. The average performance of Family Medicine residents was 56.9% (95% CI 56.2%-57.6%). The performance of GPT-3.5 was similar to that of the average Family Medicine resident (P=.16), while the performance of GPT-4 exceeded that of the top-performing Family Medicine resident (P<.001). GPT-4 significantly outperforms both GPT-3.5 and Family Medicine residents on a multiple-choice medical knowledge test designed for Family Medicine residents. GPT-4 provides a logical rationale for its response choice, ruling out other answer choices efficiently and with concise justification. Its high degree of accuracy and advanced reasoning capabilities facilitate its potential applications in medical education, including the creation of exam questions and scenarios as well as serving as a resource for medical knowledge or information on community services.

Full Text

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

Assessment of Resident and AI Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: Comparative Study.

Abstract

Talk to us

Similar Papers

More From: JMIR medical education

Lead the way for us

Similar Papers

647 Evaluating Awareness and Knowledge of the Canadian Cardiovascular Society's Guidelines for the Diagnosis and Management of Heart Failure Among Residents in General Internal and Family Medicine: A Pilot Study
J.P Akerman ... C Pullen
Canadian Journal of Cardiology | VOL. 28
J.P Akerman, et. al.J.P Akerman ... C Pullen
01 Sep 2012
Canadian Journal of Cardiology | VOL. 28

Are university-based residency training programs lacking in resident education of proper diagnosis and treatment for common skin and breast lesions?
Stephanie M Cohen ... Mark S Cohen
The American Journal of Surgery | VOL. 204
Stephanie M Cohen, et. al.Stephanie M Cohen ... Mark S Cohen
01 Dec 2012
The American Journal of Surgery | VOL. 204

Impact of early waves of the COVID-19 pandemic on family medicine residency training: Analysis of survey data.
Laura Diamond ... Milena Forte
Canadian family physician Medecin de famille canadien | VOL. 69
Laura Diamond, et. al.Laura Diamond ... Milena Forte
01 Apr 2023
Canadian family physician Medecin de famille canadien | VOL. 69

Sleeping at Home: A New Model for a Hospital Teaching Service
Deborah R Erlich ... Allen F Shaughnessy
Journal of Graduate Medical Education | VOL. 3
Deborah R Erlich, et. al.Deborah R Erlich ... Allen F Shaughnessy
01 Jun 2011
Journal of Graduate Medical Education | VOL. 3

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Assessment of Resident and AI Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: Comparative Study.

Abstract

Talk to us

Similar Papers

More From: JMIR medical education