Abstract

Since the advent of Transformers followed by Vision Transformers (ViTs), enormous success has been achieved by researchers in the field of computer vision and object detection. The difficulty mechanism of splitting images into fixed patches posed a serious challenge in this arena and resulted in loss of useful information at the time of object detection and classification. To overcome the challengers, we propose an innovative Intelligent-based patching mechanism and integrated it seamlessly into the conventional Patch-based ViT framework. The proposed method enables the utilization of patches with flexible sizes to capture and retain essential semantic content from input images and therefore increases the performance compared with conventional methods. Our method was evaluated with three renowned datasets Microsoft Common Objects in Context (MSCOCO-2017), Pascal VOC (Visual Object Classes Challenge) and Cityscapes upon object detection and classification. The experimental results showed promising improvements in specific metrics, particularly in higher confidence thresholds, making it a notable performer in object detection and classification tasks.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.