Abstract

Transformer applications have been widely used in the computer vision field. Many related literatures show that the advantages of the model such as increased receptive field and globality are gradually emerging in image processing. However, with the popularity of the transformer, whether it can compete with the convolutional neural network (CNN) in terms of performance is still questionable and remains to be further studied. This paper will use the most basic structural model in the visual transformer (ViT) to classify and identify Chinese characters that are frequently used in the field of transportation and logistics and compare them with two classical CNN models. The results demonstrate that the performance of the transformer is obviously better than that of the traditional CNN structure, and the final accuracy of character recognition is higher than that of CNN, up to 98.66 %. It fully shows the infinite potential and excellent performance of the transformer in the area of computer vision and has high reliability and generalization ability.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call