Abstract

Object classification under partial occlusion has been challenging for deep convolutional neural networks due to their innate locality in extracting features. We propose an Occlusion-aware Spatial Attention Transformer (OSAT) architecture based on Vision Transformer (ViT), CutMix augmentation, and Occlusion Mask Predictor (OMP) to solve the occlusion problem. ViT mainly utilizes the self-attention mechanism, which enables the model to capture spatially distant information. In addition, for occluded image augmentation, we combine CutMix augmentation with ViT. OMP is used as a multi-task learning method and for spatial attention on non-occluded region. Our proposed OSAT achieves state-of-the-art performance on occluded vehicle classification datasets from PASCAL3D+ and MS-COCO. Moreover, additional experiments show that OMP outperforms previous approach in occluder localization both quantitatively and qualitatively. According to our ablation studies, ViT is effective at analyzing occluded objects, and our approach of CutMix augmentation and OMP led to further improvements.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call