Abstract Disclosure: J. Tarkoff: None. A.G. Martinez Sanchez: None. Large language models (LLMs) hold substantial promise for improving physician knowledge and expertise. Their role in medical education and the generation of diverse diagnoses can be crucial, potentially leading to positive effects on clinical outcomes. The New England Journal of Medicine pediatric case challenges have examined ChatGPT [1]. Yet, its performance in specialized areas like Pediatric Endocrinology remains unexplored. This study assesses the effectiveness of ChatGPT 4 in responding to questions from the Pediatric Endocrine Self-Assessment Program (PESAP). It also examines if the AI is suitable for creating an educational quiz for residents and fellows. Methods: ChatGPT 4 underwent testing with questions from the 2021-2022 version of PESAP, utilizing the prompt: “Can you assist from the perspective of a pediatric endocrinologist with the following patient case”. Responses were evaluated for initial correctness, and performance was analyzed across various competency categories corresponding to the 7 “umbrella sections” of the tool (Adrenal, Bone, Carbohydrate and Lipid Metabolism/Obesity, Growth, Pituitary, Reproductive System, and Thyroid). Subsequently, we personalized the model by incorporating the questions and detailed responses from the PESAP. This customization resulted in a model that was then utilized to create a 10-question proof-of-concept quiz for four board-certified pediatric endocrinologists. The quiz included a scoring system designed to measure the extent and depth of ChatGPT knowledge. Results: ChatGPT 4 accurately answered 52% of PESAP questions, demonstrating varying performance across specific categories, ranging from 30% (Adrenal) to 78% (Reproductive System). In 16 questions, ChatGPT 4 did not provide an initial answer, requiring a specific request for a response. For questions related to thyroid cancer, explicit prompts were necessary to instruct responses based on the American Thyroid Association 2015 guidelines. In the endocrinologist quiz, the average score was 80%, ranging from 60% to 100%. Discussion: Currently, using ChatGPT 4 as the final diagnostic tool, especially in pediatric endocrinology, should be approached with caution, based on our assessment. However, when incorporating distinct Pediatric Endocrinology case studies, ChatGPT 4 successfully generated valuable educational questions, reinforcing fundamental concepts in the field. We anticipate that as LLMs advance and receive direct medical training, the out-of-the-box diagnostic accuracy will improve, leading to a transformative impact on medical education through this technology.
Read full abstract