Abstract

ABSTRACT Recognising workflow phases from endoscopic surgical videos is crucial to deriving indicators that convey the quality, efficiency, outcome of the surgery, and offering insights into surgical team skills. Additionally, workflow information is used to organise large surgical video libraries for training purposes. In this paper, we explore different deep networks that capture spatial and temporal information from surgical videos for surgical workflow recognition. The approach is based on a combination of two networks: The first network is used for feature extraction from video snippets. The second network is performing action segmentation to identify the different parts of the surgical workflow by analysing the extracted features. This work focuses on proposing, comparing, and analysing different design choices. This includes fully convolutional, fully transformer, and hybrid models, which consist of transformers used in conjunction with convolutions. We evaluate the methods against a large dataset of endoscopic surgical videos acquired during Gastric Bypass surgery. Both our proposed fully transformer method and fully convolutional approach achieve state-of-the-art results. By integrating transformers and convolutions, our hybrid model achieves 93% frame-level accuracy and 85 segmental edit distance score. This demonstrates the potential of hybrid models that employ both transformers and convolutions for accurate surgical workflow recognition.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call