Category: Other; Ankle Introduction/Purpose: Artificial intelligence chatbots have seen a notable rise in recent years, especially with the creation of ChatGPT, a chatbot that utilizes a language learning model with an ability to carry human-like conversation. While ChatGPT does not have any specific medicine-related training, prior studies have shown that its newest version, GPT-4, has been able to pass professional licensing examinations and perform comparably on question bank sets as surgical residents. The purpose of this study was to explore the diagnostic and decision-making capacities of ChatGPT-4 in clinical management specifically assessing for accuracy in identification and treatment of foot and ankle pathologies. Methods: This study presented 16 foot and ankle cases to ChatGPT-4. Each case was evaluated by 3 fellowship-trained foot and ankle orthopaedic surgeons. The scoring system included 5 criteria within a Likert scale, with 5 being the lowest score and 25 being the highest possible. The criteria included stating the correct diagnosis, stating the most appropriate procedure, identification of alternative treatments, providing comprehensive information beyond treatment, and not mentioning nonexisting therapies. ChatGPT-4 was referred to as “Dr. GPT”, using role prompting to encourage step-by-step processing and establish a peer dynamic so that the role of an orthopaedic surgeon was emulated by the chatbot. The 16 cases presented included the following: plantar fasciitis, Morton neuroma, ankle sprain, Achilles tendon rupture, Achilles tendonitis, metatarsalgia, peroneal tendon tear, posterior tibial tendon insufficiency, distal fibula fracture, 2nd metatarsal stress fracture, 5th metatarsal fracture, ankle arthritis, hallux rigidus, Lisfranc injury, midfoot arthritis, and hallux valgus. Results: The average score across all criteria for all 16 cases was 4.47, with an average sum score of 22.4. The plantar fasciitis case received the highest score, with an average sum score of 24.7. The lowest score was observed in the peroneal tendon tear case, with an average sum score of 16.3. Subgroup analyses of each of the 5 criterion using Friedman Rank Sum tests showed no statistically significant differences in surgeon grading. Criterion 5, lack of mention of nonexistent treatment options, and criterion 1, the ability for ChatGPT to correctly diagnose, received the highest subgroup scores of 4.88 and 4.77, respectively. The lowest criteria score was observed in criteria 4 (4.05), evaluating ChatGPT-4 providing comprehensive information beyond treatment options. Conclusion: This study demonstrates that ChatGPT-4 effectively diagnosed and provided reliable treatment for most foot and ankle cases presented, noting consistency amongst surgeon evaluators. The individual criterion assessment revealed that ChatGPT-4 was most effective in diagnosing pathologies. Additionally, the chatbot consistently did not suggest nonexistent treatment options, a common finding in prior studies evaluating ChatGPT-3.5 in which fabricated information was presented in a manner that was seemingly true. This resource could be useful for clinicians seeking patient education materials on diagnoses and treatment options without fear of incorrect information presentation, though comprehensive information beyond treatment may be limited.
Read full abstract