TMF-Net: A Transformer-Based Multiscale Fusion Network for Surgical Instrument Segmentation From Endoscopic Images

Lei Yang,Yanhong Liu,Guibin Bian,Yuge Gu

doi:10.1109/tim.2022.3225922

Abstract

Automatic surgical instrument segmentation is a necessary step for the steady operation of surgical robots, and the segmentation accuracy directly affects the surgical effect. Nevertheless, accurate surgical instrument segmentation from endoscopic images remains a challenging task due to the complex environment and instrument motion during surgery. Based on the encoder–decoder structure, a transformer-based multiscale fusion network, named TMF-Net, is proposed to address the difficulties in the area of surgical instrument segmentation. To realize the effective feature representation based on the pretrained ResNet34 and transformer and to strengthen both advantages of different encoder units, a dual-encoder unit is proposed to simultaneously learn the semantic relationship between adjacent pixels and distant pixels and comprehensively capture the global context information. Meanwhile, to retain more contextual information, a trapezoid atrous spatial pyramid pooling (trapezoid ASPP) block is proposed for feature enhancement of local features with different receptive fields to enrich feature information. Furthermore, in addition to multiscale surgical instruments in endoscopic images, a multiscale attention fusion (MAF) block is proposed to fuse multiscale feature maps to make the segmentation network direct more attention to the efficient channels so that it can improve the segmentation accuracy. Two typical datasets are used for performance analysis and verification, including Kvasir-Instrument and Endovis2017. Experimental results indicate that the proposed TMF-Net could effectively improve segmentation accuracy on surgical instruments, and it could also yield a competitive segmentation result in comparison with advanced detection methods.

Full Text