Object detection is a fundamental part of autonomous driving algorithms, and with the promotions of transformers in a couple of years, numerous computer vision tasks are integrating transformers into object detectors to acquire a better generalization ability. Building a pure transformer-based detector seems to be a wonderful choice; however, transformers are not omnipotent, and they come with painful drawbacks. Its fundamental operator, multi-head self-attention (MHSA), suffers from the need for computational resources due to its quadratic complexity, which demands an unreasonably high memory usage and critically low throughput. To address this issue, we use a convolution operation to simulate MHSA from transformers by referencing the philosophy and principle of MHSA and making an application migration on convolutional neural networks (CNNs). This gives a detector with power and speed simultaneously. Furthermore, a multi-scale pyramidal feature extractor gives the detector a better view over various scales. In general, our proposed object detector mainly follows the philosophy of attention mechanism, which is implemented by a multi-scale feature pyramidal CNN encoder that simulates the transformer, and a real transformer query neck to extract all of the objects once and, eventually, feed them to the output heads. After training on the COCO2017 dataset, by combining the construction philosophy of the object detector and the philosophy and characteristics of the transformer, our FPDT-Tiny gives an average precision (AP) of up to 34.1 in 150 lower epochs, which is 16.0 and 10.8 higher than CNN-based YOLOv3-Base and SSD-300, respectively. Also, the AP given by our FPDT-Small is up to 37.7 under the same epoch, which is 10.4 and 7.9 higher than the transformer-based detector YOLOS-Small and DETR-ResNet-152, respectively, also demonstrating a comparable performance.
Read full abstract