This study addresses challenges in surgical education, particularly in neuroendoscopy, where the demand for optimized workflow conflicts with the need for trainees' active participation in surgeries. To overcome these challenges, we propose a framework that accurately identifies anatomical structures within images guided by language descriptions, facilitating authentic and interactive learning experiences in neuroendoscopy. Utilizing the encoder-decoder architecture of a conventional transformer, our framework processes multimodal inputs (images and language descriptions) to identify and localize features in neuroendoscopic images. We curate a dataset from recorded endoscopic third ventriculostomy (ETV) procedures for training and evaluation. Utilizing evaluation metrics, including "R@n," "IoU= θ," "mIoU," and top-1 accuracy, we systematically benchmark our framework against state-of-the-art methodologies. The framework demonstrates excellent generalization, surpassing the compared methods with 93.67 % accuracy and 76.08 % mIoU on unseen data. It also exhibits better computational speed compared with other methods. Qualitative results affirms the framework's effectiveness in precise localization of referred anatomical features within neuroendoscopic images. The framework's adeptness at localizing anatomical features using language descriptions positions it as a valuable tool for integration into future interactive clinical learning systems, enhancing surgical training in neuroendoscopy. The exemplary performance reinforces the framework's potential in enhancing surgical education, leading to improved skills and outcomes for trainees in neuroendoscopy.
Read full abstract