Surgical skill assessment currently hinges on the manual observations of senior surgeons, and the assessment process is inherently time-consuming and subjective. Hence, there is a need to develop machine learning-based automated robotic surgical skill assessment. However, the existing machine learning-based works are only built in either the time domain or frequency domain but have never considered the investigation on the time–frequency domain. To fill the research gap, we explore the representation of the surgery motion data in the time–frequency domain. In this study, we propose a novel automated robotic surgical skill assessment framework called Continuous Wavelet Transform-Vision Transformer (CWT-ViT). We apply continuous wavelet transform, i.e., a time–frequency representation method, to convert robotic surgery kinematic data to synthesis images. Furthermore, by taking advantage of the prior knowledge of the da Vinci surgical system, we design a four branches-based architecture, each branch representing a robotic manipulator. We have conducted extensive experiments and achieved comparable results on the benchmark robotic surgical skill dataset JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS). Our proposed CWT-ViT framework has demonstrated the feasibility of applying time–frequency representation for automated robotic surgical skill assessment using kinematic data. The code is available at https://github.com/yiming95/CWT-ViT-Surgery.
Read full abstract