Anatomical sites recognition is a basic requirement for gastroenterologists. But there is not a unified framework for anatomical sites identification in both ordinary and capsule endoscopes. Deep learning (DL), especially vision transformer (ViT), is promising in the field of medical imaging, however, the performance of them is not comprehensively compared. The retrospective cohort study included 322 patients who visited Friendship hospital for capsule endoscopy and 556 patients who visited Minhang hospital for ordinary endoscopy. Convolutional neural network (CNN) and two types of ViT (B/16 and L/32) that was trained to identify qualified and low-quality images (the first model), then the qualified images were used to train the second model for distinguishing different anatomical sites. 62,850 images from capsule endoscopy and 17,434 images from ordinary endoscopy were included in developing models. In internal cross-validation, CNN achieved average area under receiver operating characteristic curve (AUROC) of 0.9844 (95 % confidence interval [CI] 0.9640–0.9960) in distinguishing qualified and low-quality images, and 0.9251 average accuracy (95 % CI 0.9133–0.9369) in distinguishing different anatomical sites. Besides, the performance of ViT did not surpass the performance of CNN. 18,636 images from 355 patients who received capsule endoscopy and 15,949 images of 501 patients who received ordinary endoscopy were prospectively collected. The AUROC of CNN reached 0.8715 (95 % CI 0.8674–0.8754) in the first model, and 0.8376 accuracy (95 % CI 0.8336–0.8414) for the second model, respectively. The performance of CNN is better than ViT with the same hyperparameter setting for sifting out the unqualified images and distinguishing anatomical sites effectively.
Read full abstract