Maintaining a high input resolution is crucial for more complex tasks like detection or segmentation to ensure that models can adequately identify and reflect fine details in the output. This study aims to reduce the computation costs associated with high-resolution input by using a variant of transformer, known as the Adaptive Clustering Transformer (ACT). The proposed model is named ACT-FRCNN. Which integrates ACT with a Faster Region-Based Convolution Neural Network (FRCNN) for a detection task head. In this paper, we proposed a method to improve the detection framework, resulting in better performance for out-of-domain images, improved object identification, and reduced dependence on non-maximum suppression. The ACT-FRCNN represents a significant step in the application of transformer models to challenging visual tasks like object detection, laying the foundation for future work using transformer models. The performance of ACT-FRCNN was evaluated on a variety of well-known datasets including BSDS500, NYUDv2, and COCO. The results indicate that ACT-FRCNN reduces over-detection errors and improves the detection of large objects. The findings from this research have practical implications for object detection and other computer vision tasks.
Read full abstract