Accurate identification of surgical instruments is crucial for efficient workflows and patient safety within the operating room, particularly in preventing complications such as retained surgical instruments. Artificial Intelligence (AI) models have shown the potential to automate this process. This study evaluates the accuracy of publicly available Large Language Models (LLMs)—ChatGPT-4, ChatGPT-4o, and Gemini—and a specialized commercial mobile application, Surgical-Instrument Directory (SID 2.0), in identifying surgical instruments from images. The study utilized a dataset of 92 high-resolution images of 25 surgical instruments (retractors, forceps, scissors, and trocars) photographed from multiple angles. Model performance was evaluated using accuracy, weighted precision, recall, and F1 score. ChatGPT-4o exhibited the highest accuracy (89.1%) in categorizing instruments (e.g., scissors, forceps). SID 2.0 (77.2%) and ChatGPT-4 (76.1%) achieved comparable accuracy, while Gemini (44.6%) demonstrated lower accuracy in this task. For precise subtype identification of instrument names (like “Mayo scissors” or “Kelly forceps”), all models had low accuracy, with SID 2.0 having an accuracy of 39.1%, followed by ChatGPT-4o (33.69%). Subgroup analysis revealed ChatGPT-4 and 4o recognized trocars in all instances. Similarly, Gemini identified surgical scissors in all instances. In conclusion, publicly available LLMs can reliably identify surgical instruments at the category level, with ChatGPT-4o demonstrating an overall edge. However, precise subtype identification remains a challenge for all models. These findings highlight the potential of AI-driven solutions to enhance surgical-instrument management and underscore the need for further refinements to improve accuracy and support patient safety.
Read full abstract