Comparison and analysis of computer vision models based on images of catamount and canid

Feng Jiang,Yueyufei Ma,Langyu Wang

doi:10.1117/12.2671468

Abstract

Nowadays, target recognition, driverless, medical impact diagnosis, and other applications based on image recognition in life, scientific research, and work, rely mainly on a variety of large models with excellent performance, from the Convolutional Neural Network (CNN) at the beginning to the various variants of the classical model proposed now. In this paper, we will take the example of identifying catamount and canid datasets, comparing the efficiency and accuracy of CNN, Vision Transformer (ViT), and Swin Transformer laterally. We plan to run 25 epochs for each model and record the accuracy and time consumption separately. After the experiments we find that from the comparison of the epoch numbers and the real-time consumption, the CNN takes the least total time, followed by Swin Transformer. Also, ViT takes the least time to reach convergence, while Swin Transformer takes the most time. In terms of training accuracy, ViT has the highest training accuracy, followed by Swin Transformer, and CNN has the lowest training accuracy; the validation accuracy is similar to the training accuracy. ViT has the highest accuracy, but takes the longest time; conversely, CNN takes the shortest time and has the lowest accuracy. Swin Transformer, which seems a combination of CNN and ViT, is most complex but with ideal performance. In the future, ViT is indeed a promising model that deserves further research and exploration to contribute to the computer vision field.

Full Text