The Investigation of Performance Comparison for VGG, YOLO, and DINO in Image Classification

Yanqi Chen

doi:10.54097/9bgem219

Abstract

The rise of artificial intelligence has led to a proliferation of deep learning models, yet there remains a noticeable shortage of comparative analyses, particularly among computer vision models rooted in different design philosophies. As such, this study seeks to delve into the strengths of various models through an examination of their structural attributes, with the aim of offering insights that can inform the development of more high-performing models in the future. This study first selects three representative models with different design ideas in their respective research directions, and preliminarily distinguishes the differences between different models. Then, through experiments on the dataset, the performance of different models is obtained, and the reasons for their current performance are analyzed. In this experiment, four models, VGG16, YOLOv5, YOLOv8, and DINOv2, were deployed and tested using the Fruit 360 dataset. The final accuracy was 0.955, 0.997, 0.998 and 0.986, respectively. The accuracy of YOLO model and DINO model was much higher than that of VGG model. The reason for this result may be related to the introduction of anchor boxes in the YOLO model and attention mechanisms in the DINO model, both of which indirectly increase the receptive area for feature extraction. The YOLOv8 model has a slight improvement in accuracy compared to the YOLOv5 model, possibly due to its use of a decoupled head, which reduces the impact of location information on classification tasks.

Full Text