Abstract

Due to the complex ornamentation and special composition of ethnic minority costumes, the performance of current costume image recognition algorithms is limited.Models based on convolutional neural networks can extract deep semantic features from clothing images, and perform better in datasets with more images, but ignore the large-scale features of images along the dimensional direction. Therefore, we propose an improved model based on Vision Transformer, which extracts the features of the image along the height and width directions through asymmetric convolution, and then inputs them into the Transformer encoder for serialization and encoding, and uses its output to get the recognition result. Using the accuracy as the evaluation index on the minority clothing dataset, the results show that the method we proposed performs better than ResNet34, and is 1.2% higher than the classic Vision Transformer.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call