Abstract

Since the success of Vision Transformers (ViTs), there has been growing interest in combining ConvNets and Transformers in the computer vision community. While the hybrid models have demonstrated state-of-the-art performance, many of these models are too large and complex to be applied to edge devices for real-world applications. To address this challenge, we propose an efficient hybrid network called ECTFormer that leverages the strengths of ConvNets and Transformers while considering both model performance and inference speed. Specifically, our approach involves: (1) optimizing the combination of convolution kernels by dynamically adjusting kernel sizes based on the scale of feature tensors; (2) revisiting existing overlapping patchify to not only reduce the model size but also propagate fine-grained patches for the performance enhancement; and (3) introducing an efficient single-head self-attention mechanism, rather than multi-head self-attention in the base Transformer, to minimize the increase in model size and boost inference speed, overcoming bottlenecks of ViTs. In experimental results on ImageNet-1K, ECTFormer not only demonstrates comparable or higher top-1 accuracy but also faster inference speed on both GPUs and edge devices compared to other efficient networks.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.