A Survey of Object Detection Based on CNN and Transformer

Ershat Arkin,Kurban Ubul,Yusnur Muhtar,Nurbiya Yadikar

doi:10.1109/prml52754.2021.9520732

Abstract

The task of object detection is to find all the objects of interest in the image, and to determine their classifications and positions, which is one of the core problems in the field of computer vision. Since the emergence of AlexNet, convolutional neural networks have an absolute position in the field of computer vision, and the research on convolutional neural networks and algorithm structures has become more and more in-depth. Object detection algorithms can be roughly divided into two categories: candidate-based(two stage) and regression-based(one stage). The object detection algorithm based on the candidate area has high accuracy, but the structure is complex and the detection speed is slow. The regression-based object detection algorithm has a simple structure and fast detection speed. It has high application value in the field of real-time object detection, but the detection accuracy is relatively low. With the pursuit of the speed and accuracy of object detection, researchers try to apply mainstream methods in different fields. Therefore, recently Transformers in the NLP field has been used in computer vision, such as ViT, Swin Transformer, etc. It showed transformer-based models perform similar to or better than neural network algorithms, and pointed out new paths for researchers. This paper introduces classic neural networks, discusses the advantages and disadvantages of convolutional neural networks used in object detection algorithms, and introduces the latest innovative methods of Transformer used in computer vision. Finally, the difficulties, challenges and future development of convolutional neural networks and Transformers in object detection are considered.

Full Text