Abstract

Recently, transformers have shown great promising performance in various computer vision tasks. However, the current transformer based methods ignore the information exchanges between transformer blocks, and they have not been applied in the facial attribute recognition task. In this paper, we propose a multi-zone transformer based on self-distillation for FAR, termed MZTS, to predict the facial attributes. A multi-zone transformer encoder is firstly presented to achieve the interactions of the different transformer encoder blocks, thus avoiding forgetting the effective information between the transformer encoder block groups during the iteration process. Furthermore, we introduce a new self-distillation mechanism based on class tokens, which distills the class tokens obtained from the last transformer encoder block group to the other shallow groups by interacting with the significant information between the different transformer blocks through attention. Extensive experiments on the challenging CelebA and LFWA datasets have demonstrated the excellent performance of the proposed method for FAR.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call