Abstract

Weakly supervised object localization (WSOL) aims at localizing objects with only image-level labels, which has better scalability and practicability than fully supervised methods. However, without pixel-level supervision, existing methods tend to generate rough localization maps, which hinders localization performance. To alleviate this problem, we propose an adversarial transformer network (ATNet), which aims to obtain a well-learned localization model with pixel-level pseudo labels. The proposed ATNet enjoys several merits. First, we design an object transformer ( G ) that can generate localization maps and pseudo labels effectively and dynamically, and a part transformer ( D ) to accurately discriminate detailed local differences between localization maps and pseudo labels. Second, we propose to train G and D via an adversarial process, where G can generate more accurate localization maps approaching pseudo labels to fool D . To the best of our knowledge, this is the first work to explore transformers with adversarial training to obtain a well-learned localization model for WSOL. Extensive experiments with four backbones on two standard benchmarks demonstrate that our ATNet achieves favorable performance against state-of-the-art WSOL methods. Besides, our adversarial training can provide higher robustness against adversarial attacks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call