Abstract

Recognizing surgical activities in endoscopic videos is of vital importance for developing context-aware decision support in the operating room. In this work, we model each surgical activity as an action triplet, consisting of the surgical instrument, the action, and the target organ that the instrument is interacting with. The goal is to recognize these action triplets from endoscopic videos. However, correctly recognizing fine-grained activity triplets is challenging because of the long-tail distribution of the triplet classes and the complex associations between triplets as well as within each triplet. In addition, multiple triplets may appear in a given video frame. To address these challenges, we propose a new model for surgical action triplet recognition based on a classification forest and Graph Convolutional Network (GCN), which we call Forest GCN. The classification forest is employed to calibrate fine-grained triplet classifiers by the upstream parent classifiers to suppress noisy logits of the triplet classes in the long tail. And stacked GCNs are designed to model the dependencies between triplet classes while leveraging the language embedding. Experiments on the endoscopic video dataset, CholecT50, demonstrate that our proposed method outperforms current state-of-the-art methods on surgical action triplet recognition.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call