Abstract

Local and sparse attention effectively reduce the high computational cost of global self-attention. However, they suffer from non-global dependency and coarse feature capturing, respectively. While some subsequent models employ effective interaction techniques for better classification performance, we observe that the computation and memory of these models overgrow as the resolution increases. Consequently, applying them to downstream tasks with large resolutions takes time and effort. In response to this concern, we propose an effective backbone network based on a novel attention mechanism called Concatenating glObal tokens in Local Attention (COLA) with a linear computational complexity. The implementation of COLA is straightforward, as it incorporates global information into the local attention in a concatenating manner. We introduce a learnable condensing feature (LCF) module to capture high-quality global information. LCF possesses the following properties: (1) performing a function similar to clustering, aggregating image patches into a smaller number of tokens based on similarity. (2) a constant number of aggregated tokens regardless of the image size, ensuring that it is a linear complexity operator. Based on COLA, we build COLAFormer, which achieves global dependency and fine-grained feature capturing with linear computational complexity and demonstrates impressive performance across various vision tasks. Specifically, our COLAFormer-S achieves 84.5% classification accuracy, surpassing other advanced models by 0.4% with similar or less resource consumption. Furthermore, our COLAFormer-S can achieve a better object detection performance while consuming only 1/4 of the resources compared to other state-of-the-art models. The code and models will be made publicly available.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.