Question Bank Research Articles

Background With advancements in natural language processing, tools such as Chat Generative Pre-Trained Transformers (ChatGPT) version 4.0 and Google Bard's Gemini Advanced are being increasingly evaluated for their potential in various medical applications. The objective of this study was to systematically assess the performance of these language learning models (LLMs) on both image and non-image-based questions within the specialized field of Ophthalmology. We used a review question bank for the Ophthalmic Knowledge Assessment Program (OKAP) used by ophthalmology residents nationally to prepare for the Ophthalmology Board Exam to assess the accuracy and performance of ChatGPT and Gemini Advanced. Methodology A total of 260 randomly generated multiple-choice questions from the OphthoQuestions question bank were run through ChatGPT and Gemini Advanced. A simulated 260-question OKAP examination was created at random from the bank. Question-specific data were analyzed, including overall percent correct, subspecialty accuracy, whether the question was "high yield," difficulty (1-4), and question type (e.g., image, text). To compare the performance of ChatGPT-4 and Gemini on the difficulty of questions, we utilized the standard deviation of user answer choices to determine question difficulty. In this study, a statistical analysis of Google Sheets was conducted using two-tailed t-tests with unequal variance to compare the performance of ChatGPT-4.0 and Google's Gemini Advanced across various question types, subspecialties, and difficulty levels. Results In total, 259 of the 260 questions were included in the study as one question used a video that any form of ChatGPT could not interpret as of May 1, 2024. For text-only questions, ChatGPT-4.0.0 correctly answered 57.14% (148/259, p < 0.018), and Gemini Advanced correctly answered 46.72% (121/259, p < 0.018). Both versions answered most questions without a prompt and would have received a below-average score on the OKAP. Moreover, there were 27 questions utilizing a secondary prompt in ChatGPT-4.0 compared to 67 questions in Gemini Advanced. ChatGPT-4.0 performed 68.99% on easier questions (<2 on a scale from 1-4) and 44.96% on harder questions (>2 on a scale from 1-4). On the other hand, Gemini Advanced performed 49.61% on easier questions (<2 on a scale from 1-4) and 44.19% on harder questions (>2 on a scale from 1-4). There was a statistically significant difference in accuracy between ChatGPT-4.0 and Gemini Advanced for easy (p < 0.0015) but not for hard (p < 0.55) questions. For image-only questions, ChatGPT-4.0 correctly answered 39.58% (19/48, p < 0.013), and Gemini Advanced correctly answered 33.33% (16/48, p < 0.022), resulting in a statistically insignificant difference in accuracy between ChatGPT-4.0 and Gemini Advanced (p < 0.530). A comparison against text-only and image-based questions demonstrated a statistically significant difference in accuracy for both ChatGPT-4.0 (p < 0.013) and Gemini Advanced (p < 0.022). Conclusions This study provides evidence that ChatGPT-4.0 performs better on the OKAP-style exams and is improved compared to Gemini Advanced within the context of ophthalmic multiple-choice questions. This may show an opportunity for increased worth for ChatGPT in ophthalmic medical education. While showing promise within medical education, caution should be used as a more detailed evaluation of reliability is needed.

Read full abstract

Background This study aims to evaluate the performance of OpenAI's GPT-4o in the Polish Final Dentistry Examination (LDEK) and compare it with human candidates' results. The LDEK is a standardized test essential for dental graduates in Poland to obtain their professional license. With artificial intelligence (AI) becoming increasingly integrated into medical and dental education, it is important to assess AI's capabilities in such high-stakes examinations. Materials and methods The study was conducted from August 1 to August 15, 2024, using the Spring 2023 LDEK exam. The exam comprised 200 multiple-choice questions, each with one correct answer among five options. Questions spanned various dental disciplines, including Conservative Dentistry with Endodontics, Pediatric Dentistry, Dental Surgery, Prosthetic Dentistry, Periodontology, Orthodontics, Emergency Medicine, Bioethics and Medical Law, Medical Certification, and Public Health. The exam organizers withdrew one question. GPT-4o was tested on these questions without access to the publicly available question bank. The AI model's responses were recorded, and each answer's confidence level was assessed. Correct answers were determined based on the official key provided by the Center for Medical Education (CEM) in Łódź, Poland. Statistical analyses, including Pearson's chi-square test and the Mann-Whitney U test, were performed to evaluate the accuracy and confidence of ChatGPT's answers across different dental fields. Results GPT-4o correctly answered 141 out of 199 valid questions (70.85%) and incorrectly answered 58 (29.15%). The AI performed better in fields like Conservative Dentistry with Endodontics (71.74%) and Prosthetic Dentistry (80%) but showed lower accuracy in Pediatric Dentistry (62.07%) and Orthodontics (52.63%). A statistically significant difference was observed between ChatGPT's performance on clinical case-based questions (36.36% accuracy) and other factual questions (72.87% accuracy), with a p-value of 0.025. Confidence levels also varied significantly between correct and incorrect answers, with a p-value of 0.0208. Conclusions GPT-4o's performance in the LDEK suggests it has potential as a supplementary educational tool in dentistry. However, the AI's limited clinical reasoning abilities, especially in complex scenarios, reveal a substantial gap between AI and human expertise. While ChatGPT demonstrates strong performance in factual recall, it cannot yet match the critical thinking and clinical judgment exhibited by human candidates.

Read full abstract

Question Bank Research Articles

Related Topics

Articles published on Question Bank

Question Paper Generation using Bloom’s Taxonomy

The performance of AI in medical examinations: an exploration of ChatGPT in ultrasound medical education

Comparing the dental knowledge of large language models.

Dynamic sequential cross-sectional scanning increases detection rate of congenital heart disease in sonographers: a prenatal ultrasound training program

Evaluation of a Large Language Model on the American Academy of Pediatrics' PREP Emergency Medicine Question Bank

Comprehensive review of resources and strategy for dermatology board certification exams in the United States.

Integrating Clinical Reasoning Into Medical Students' First Weeks of Education Improves Understanding of Cranial Nerve Anatomy.

The Educational Benefits of Plastic Surgery Rotations for Off-Service Residents.

ELECTRONIC ASSESSMENT ANXIETY SCALE: DEVELOPMENT, VALIDITY AND RELIABILITY

Exploration of medical students’ approach to progress test preparation

Comparative performance of artificial intelligence models in rheumatology board-level questions: evaluating Google Gemini and ChatGPT-4o.

Information that matters: Exploring information needs of people affected by algorithmic decisions

The Comparative Performance of Large Language Models on the Hand Surgery Self-Assessment Examination.

Artificial intelligence large language model scores highly on focused practice designation in metabolic and bariatric surgery board practice questions.

A Review of FRCOphth Part 1 Question Banks

Capabilities of GPT-4 in ophthalmology: an analysis of model entropy and progress towards human-level medical question answering

Comparison of Gemini Advanced and ChatGPT 4.0's Performances on the Ophthalmology Resident Ophthalmic Knowledge Assessment Program (OKAP) Examination Review Question Banks.

Computer Vision Meets Large Language Models: Performance of ChatGPT 4.0 on Dermatology Boards-Style Practice Questions

GPT-4o vs. Human Candidates: Performance Analysis in the Polish Final Dentistry Examination.

Artificial intelligence in orthopaedic education: A comparative analysis of ChatGPT and Bing AI's Orthopaedic In‐Training Examination performance

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Question Bank Research Articles

Related Topics

Articles published on Question Bank

Question Paper Generation using Bloom’s Taxonomy

The performance of AI in medical examinations: an exploration of ChatGPT in ultrasound medical education

Comparing the dental knowledge of large language models.

Dynamic sequential cross-sectional scanning increases detection rate of congenital heart disease in sonographers: a prenatal ultrasound training program

Evaluation of a Large Language Model on the American Academy of Pediatrics' PREP Emergency Medicine Question Bank

Comprehensive review of resources and strategy for dermatology board certification exams in the United States.

Integrating Clinical Reasoning Into Medical Students' First Weeks of Education Improves Understanding of Cranial Nerve Anatomy.

The Educational Benefits of Plastic Surgery Rotations for Off-Service Residents.

ELECTRONIC ASSESSMENT ANXIETY SCALE: DEVELOPMENT, VALIDITY AND RELIABILITY

Exploration of medical students’ approach to progress test preparation

Comparative performance of artificial intelligence models in rheumatology board-level questions: evaluating Google Gemini and ChatGPT-4o.

Information that matters: Exploring information needs of people affected by algorithmic decisions

The Comparative Performance of Large Language Models on the Hand Surgery Self-Assessment Examination.

Artificial intelligence large language model scores highly on focused practice designation in metabolic and bariatric surgery board practice questions.

A Review of FRCOphth Part 1 Question Banks

Capabilities of GPT-4 in ophthalmology: an analysis of model entropy and progress towards human-level medical question answering

Comparison of Gemini Advanced and ChatGPT 4.0's Performances on the Ophthalmology Resident Ophthalmic Knowledge Assessment Program (OKAP) Examination Review Question Banks.

Computer Vision Meets Large Language Models: Performance of ChatGPT 4.0 on Dermatology Boards-Style Practice Questions

GPT-4o vs. Human Candidates: Performance Analysis in the Polish Final Dentistry Examination.

Artificial intelligence in orthopaedic education: A comparative analysis of ChatGPT and Bing AI's Orthopaedic In‐Training Examination performance