Temporal Surgical Gesture Segmentation and Classification in Multi-gesture Robotic Surgery using Fine-tuned features and Calibrated MS-TCN

Snigdha Agarwal,Chakka Sai Pradeep,Neelam Sinha

doi:10.1109/spcom55316.2022.9840779

Abstract

Temporal Gesture Segmentation is an active research problem for many applications such as surgical skill assessment, surgery training, robotic training. In this paper, we propose a novel method for Gesture Segmentation on untrimmed surgical videos of the challenging JIGSAWS dataset by using a two-step methodology. We train and evaluate our method on 39 videos of the Suturing task which has 10 gestures. The length of gestures ranges from 1 second to 75 seconds and full video length varies from 1 minute to 5 minutes. In step one, we extract encoded frame-wise spatio-temporal features on full temporal resolution of the untrimmed videos. In step two, we use these extracted features to identify gesture segments for temporal segmentation and classification. To extract high-quality features from the surgical videos, we also pre-train gesture classification models using transfer learning on the JIGSAWS dataset using two state-of-the-art pretrained backbone architectures. For segmentation, we propose an improved calibrated MS-TCN (CMS-TCN) by introducing a smoothed focal loss as loss function which helps in regularizing our TCN to avoid making over-confident decisions. We achieve a frame-wise accuracy of 89.8% and an Edit Distance score of 91.5%, an improvement of 2.2% from previous works. We also propose a novel evaluation metric that normalizes the effect of correctly classifying the frames of larger segments versus smaller segments in a single score.

Full Text