Abstract

In this paper, we propose a novel method of Unsupervised Disentanglement of Scene and Motion (UDSM) representations for minimally invasive surgery video retrieval within large databases, which has the potential to advance intelligent and efficient surgical teaching systems. To extract more discriminative video representations, two designed encoders with a triplet ranking loss and an adversarial learning mechanism are established to respectively capture the spatial and temporal information for achieving disentangled features from each frame with promising interpretability. In addition, the long-range temporal dependencies are improved in an integrated video level using a temporal aggregation module and then a set of compact binary codes that carries representative features is yielded to realize fast retrieval. The entire framework is trained in an unsupervised scheme, i.e., purely learning from raw surgical videos without using any annotation. We construct two large-scale minimally invasive surgery video datasets based on the public dataset Cholec80 and our in-house dataset of laparoscopic hysterectomy, to establish the learning process and validate the effectiveness of our proposed method qualitatively and quantitatively on the surgical video retrieval task. Extensive experiments show that our approach significantly outperforms the state-of-the-art video retrieval methods on both datasets, revealing a promising future for injecting intelligence in the next generation of surgical teaching systems.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call