Despite extensive studies on large language models and their capability to respond to questions from various licensed exams, there has been limited focus on employing chatbots for specific subjects within the medical curriculum, specifically medical neuroscience. This research compared the performances of Claude 3.5 Sonnet (Anthropic), GPT-3.5, GPT-4-1106 (OpenAI), Copilot free version (Microsoft), and Gemini 1.5 Flash (Google) versus students on MCQs from the medical neuroscience course database to evaluate chatbots reliability. 5 successive attempts of each chatbot to answer 200 USMLE-style questions were evaluated based on accuracy, relevance, and comprehensiveness. MCQs were categorized into 12 categories/topics. The results indicated that at the current level of development, selected AI-driven chatbots, on average, can accurately answer 67.2% of MCQs from the medical neuroscience course, which is 7.4% below the students' average. However, Claude and GPT-4 outperformed other chatbots with 83% and 81.7% correct answers, which is better than the average student result. They followed by Copilot - 59.5%, GPT-3.5 - 58.3%, and Gemini - 53.6%. Concerning different categories, Neurocytology, Embryology, and Diencephalon were the three best topics, with average results of 78.1% - 86.7%, and the lowest results were Brainstem, Special senses, and Cerebellum, with 54.4% - 57.7% correct answers. Our study suggested that Claude and GPT-4 are currently two of the most evolved chatbots. They exhibit proficiency in answering MCQs related to neuroscience that surpasses that of the average medical student. This breakthrough indicates a significant milestone in how AI can supplement and enhance educational tools and techniques.
Read full abstract