Peripheral vision is a vital component of human visual processing that allows for efficient and accurate recognition of visual features across diverse regions of the visual field. Analogously, endoscopic images often exhibit peripheral regions of blur, due to their inherent imaging properties. Previous strategies employing either coarse-grained global attention or fine-grained local attention to enhance performance have often inadvertently compromised the intrinsic self-attention mechanism of multilayer transformers, leading to less optimal solutions. This research introduces Self-Peripheral-Attention (SPA), an innovative mechanism that incorporates peripheral vision modeling into self-attention, so as to enhance the accuracy and efficiency of classification and segmentation tasks in endoscopic imaging. SPA synthesizes fine-grained central and coarse-grained peripheral interactions and possesses three primary characteristics: (i) peripheral contextualization aggregation; (ii) interaction between coarse-grained peripheral and fine-grained central features facilitated by depthwise dilated convolution; (iii) element-wise affine transformation to integrate attention into the value. The effectiveness and generalizability of the proposed SPA-Net were assessed on XJUEE, XJUEE-SEG, Kvasir and Kvasir-SEG endoscopy datasets. The results underscore the potential of peripheral vision modeling in self-attention for augmenting machine perception models. The associated code can be accessed at https://github.com/huoxiangzuo/SPA.
Read full abstract