Abstract

This paper introduces TrpViT, a novel triple attention vision transformer that efficiently captures both local and global features. The proposed architecture tackles global information acquisition by employing three complementary attention mechanisms in a unique attention block: Window, Dilated, and Channel attention. This attention block extracts spatially local features while expanding the receptive field to capture richer global context. By integrating this attention block with convolution, a new C-C-T-T architecture is formed. We rigorously evaluate TrpViT, demonstrating state-of-the-art performance on various computer vision tasks, including image classification, 2D and 3D object detection, instance segmentation, and low-level image colorization. Notably, TrpViT achieves strong accuracy across all parameter scales, highlighting its computational efficiency and effectiveness.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.