Abstract

AbstractFine-grained visual classification (FGVC) is an essential and challenging classification task in computer visual classification, aiming to identify different cars and birds. Recently, most studies use a convolutional neural network combined with an attention mechanism to find discriminant regions to improve algorithm accuracy automatically. However, the discriminant regions selected by the convolutional neural network are extensive. Vision Transformer divides the image into patches and relies on self-attention to select more accurate discriminant regions. However, the Vision Transformer model ignores the response between local patches before patch embedding. In addition, patches usually have high similarity, and they are considered redundant. Therefore, we propose a PEDTrans model based on Vision Transformer. The model has a patch enhancement module based on attention mechanism and a random similar group patch discarding module based on similarity. These two modules can establish patch local feature relationships and select patches that are easier to distinguish between images. Combining these two modules with the Vision Transformer backbone network can improve the fine-grained visual classification accuracy. We employ commonly used fine-grained visual classification datasets CUB-200-2011, Stanford Cars, Stanford Dogs and NABirds to get advanced results.KeywordsFine-grained visual classificationVision TransformerSelf-attention

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call