Abstract

The query-based object detection models are improvement of transformer. Using the learnable query, they complete the transition from traditional dense detection to sparse detection. However, such models have the weaknesses of slow convergence speed, poor adaptability to target changes, and additional networks. The CycMixer proposed in this paper uses multi-scale granule cluster sampling for encoding, avoiding a large number of parameters brought by the explicit structure. It adopts the cycle mixing module to improve the adaptability and increase the receptive field to cope with the change in detection targets by means of adaptive channel sampling and cycle spatial sampling. Firstly, the query extracted by backbone is decoupled into content vector and positional vector. Then the multi-head attention mechanism generates offsets. Secondly, the multi-scale feature map obtained by backbone and the transformed position enter the multi-scale granule cluster sampling stage. The stage consists of multi-scale feature space generation and granule cluster sampling. In addition, offsets and original positional vectors are converted to a new positional vector, which is decoded to get the bounding box. Finally, the content vector and the sampled matrix enter the cycle mixing. Cycle mixing consists of adaptive channel mixing and cycle spatial mixing. The mixed content vector is updated by the post feed-forward network (FFN) to receive a new content vector. Another FFN outputs the predicted categories. Compared with the existing detectors, the AP of our method on MS COCO dataset can reach 44.0 in 12 epochs and 46.0 in 36 epochs of training, which has obvious advantages in the convergence speed and the simplicity of the model. Experimental results on the cityscapes dataset also demonstrated the superiority of the method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call